2025-12-04T08:55:48.2851235Z Current runner version: '2.330.0' 2025-12-04T08:55:48.2857449Z Runner name: 'i-004b897e3464f6ebc' 2025-12-04T08:55:48.2858241Z Runner group name: 'Default' 2025-12-04T08:55:48.2859262Z Machine name: 'ip-10-0-34-254' 2025-12-04T08:55:48.2862159Z ##[group]GITHUB_TOKEN Permissions 2025-12-04T08:55:48.2864410Z Contents: read 2025-12-04T08:55:48.2865151Z Metadata: read 2025-12-04T08:55:48.2865642Z ##[endgroup] 2025-12-04T08:55:48.2867703Z Secret source: Actions 2025-12-04T08:55:48.2868612Z Prepare workflow directory 2025-12-04T08:55:48.3343683Z Prepare all required actions 2025-12-04T08:55:48.3377298Z Getting action download info 2025-12-04T08:55:48.6704417Z Download action repository 'pytorch/test-infra@main' (SHA:39aa74d619174326f4e2fb0e216151c2f29d9ffd) 2025-12-04T08:55:51.3055353Z Download action repository 'pytorch/pytorch@main' (SHA:eabb7ad2128580ef674446027b95bcf4e21e8df3) 2025-12-04T08:56:08.0423006Z Download action repository 'actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065' (SHA:a26af69be951a213d495a4c3e4e4022e16d87065) 2025-12-04T08:56:08.3918835Z Download action repository 'aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722' (SHA:ececac1a45f3b08a01d2dd070d28d111c5fe6722) 2025-12-04T08:56:08.6499710Z Download action repository 'aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076' (SHA:062b18b96a7aff071d4dc91bc00c4c1a7945b076) 2025-12-04T08:56:08.9213061Z Download action repository 'seemethere/download-artifact-s3@1da556a7aa0a088e3153970611f6c432d58e80e6' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2025-12-04T08:56:09.1991640Z Download action repository 'seemethere/upload-artifact-s3@baba72d0712b404f646cebe0730933554ebce96a' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-12-04T08:56:09.4943233Z Getting action download info 2025-12-04T08:56:09.6608001Z Download action repository 'actions/checkout@v4' (SHA:34e114876b0b11c390a56381ad16ebd13914f8d5) 2025-12-04T08:56:10.0137840Z Getting action download info 2025-12-04T08:56:10.1570588Z Download action repository 'nick-fields/retry@v3.0.0' (SHA:7152eba30c6575329ac0576536151aca5a72780e) 2025-12-04T08:56:10.4101124Z Getting action download info 2025-12-04T08:56:10.5303070Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-12-04T08:56:10.9284184Z Getting action download info 2025-12-04T08:56:11.1371539Z Uses: pytorch/pytorch/.github/workflows/_linux-test.yml@refs/heads/main (ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32) 2025-12-04T08:56:11.1375244Z ##[group] Inputs 2025-12-04T08:56:11.1375592Z build-environment: linux-jammy-cuda12.8-py3.10-gcc11 2025-12-04T08:56:11.1382784Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T08:56:11.1390659Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:56:11.1391378Z sync-tag: 2025-12-04T08:56:11.1392133Z timeout-minutes: 360 2025-12-04T08:56:11.1392516Z use-gha: 2025-12-04T08:56:11.1392799Z dashboard-tag: 2025-12-04T08:56:11.1393125Z s3-bucket: gha-artifacts 2025-12-04T08:56:11.1393409Z aws-role-to-assume: 2025-12-04T08:56:11.1394058Z disable-monitor: false 2025-12-04T08:56:11.1394460Z monitor-log-interval: 5 2025-12-04T08:56:11.1394791Z monitor-data-collect-interval: 1 2025-12-04T08:56:11.1395117Z ##[endgroup] 2025-12-04T08:56:11.1395706Z Complete job name: linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T08:56:11.1996800Z A job started hook has been configured by the self-hosted runner administrator 2025-12-04T08:56:11.2089877Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-12-04T08:56:11.2100088Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:56:11.2100627Z ##[endgroup] 2025-12-04T08:56:12.6416309Z Runner Type: lf.linux.g6.4xlarge.experimental.nvidia.gpu 2025-12-04T08:56:12.6416792Z Instance Type: g6.4xlarge 2025-12-04T08:56:12.6417016Z AMI Name: unknown 2025-12-04T08:56:12.6451099Z AMI ID: ami-08982f1c5bf93d976 2025-12-04T08:56:17.3776019Z ##[group]Run pytorch/test-infra/.github/actions/setup-ssh@main 2025-12-04T08:56:17.3776380Z with: 2025-12-04T08:56:17.3776892Z github-secret: *** 2025-12-04T08:56:17.3777414Z instructions: All testing is done inside the container, to start an interactive session run: docker exec -it $(docker container ps --format '{{.ID}}') bash 2025-12-04T08:56:17.3777964Z activate-with-label: false 2025-12-04T08:56:17.3778165Z label: with-ssh 2025-12-04T08:56:17.3778337Z remove-existing-keys: true 2025-12-04T08:56:17.3778535Z fail-silently: true 2025-12-04T08:56:17.3778740Z env: 2025-12-04T08:56:17.3778884Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:56:17.3779092Z ##[endgroup] 2025-12-04T08:56:17.4843472Z Please see https://github.com/pytorch/pytorch/wiki/Debugging-using-with-ssh-for-Github-Actions for more info. 2025-12-04T08:56:17.4845013Z Not on pull request and ciflow reference could not be extracted, skipping adding ssh keys 2025-12-04T08:56:17.4974170Z ##[group]Run pytorch/pytorch/.github/actions/checkout-pytorch@main 2025-12-04T08:56:17.4974738Z with: 2025-12-04T08:56:17.4974904Z no-sudo: true 2025-12-04T08:56:17.4975094Z submodules: recursive 2025-12-04T08:56:17.4975285Z fetch-depth: 0 2025-12-04T08:56:17.4975451Z env: 2025-12-04T08:56:17.4975611Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:56:17.4975801Z ##[endgroup] 2025-12-04T08:56:17.5039156Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T08:56:17.5039867Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T08:56:17.5050442Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:56:17.5050734Z env: 2025-12-04T08:56:17.5050913Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:56:17.5051149Z ##[endgroup] 2025-12-04T08:56:17.5150789Z ##[group]Run # Use all available CPUs for fetching 2025-12-04T08:56:17.5151152Z # Use all available CPUs for fetching 2025-12-04T08:56:17.5151424Z cd "${GITHUB_WORKSPACE}" 2025-12-04T08:56:17.5151686Z git config --global fetch.parallel 0 2025-12-04T08:56:17.5151987Z git config --global submodule.fetchJobs 0 2025-12-04T08:56:17.5152250Z  2025-12-04T08:56:17.5152527Z # Clean workspace. The default checkout action should also do this, but 2025-12-04T08:56:17.5152883Z # do it here as well just in case 2025-12-04T08:56:17.5153141Z if [[ -d .git ]]; then 2025-12-04T08:56:17.5153368Z  if [ -z "${NO_SUDO}" ]; then 2025-12-04T08:56:17.5153674Z  sudo git clean -ffdx 2025-12-04T08:56:17.5153929Z  else 2025-12-04T08:56:17.5154175Z  git clean -ffdx 2025-12-04T08:56:17.5154381Z  fi 2025-12-04T08:56:17.5154560Z fi 2025-12-04T08:56:17.5161502Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:56:17.5162059Z env: 2025-12-04T08:56:17.5162295Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:56:17.5162516Z NO_SUDO: true 2025-12-04T08:56:17.5162709Z ##[endgroup] 2025-12-04T08:56:17.5273839Z ##[group]Run actions/checkout@v4 2025-12-04T08:56:17.5274051Z with: 2025-12-04T08:56:17.5274248Z ref: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:56:17.5274492Z fetch-depth: 0 2025-12-04T08:56:17.5274663Z submodules: recursive 2025-12-04T08:56:17.5274854Z show-progress: false 2025-12-04T08:56:17.5275056Z repository: pytorch/pytorch 2025-12-04T08:56:17.5275395Z token: *** 2025-12-04T08:56:17.5275557Z ssh-strict: true 2025-12-04T08:56:17.5275731Z ssh-user: git 2025-12-04T08:56:17.5275901Z persist-credentials: true 2025-12-04T08:56:17.5276102Z clean: true 2025-12-04T08:56:17.5276282Z sparse-checkout-cone-mode: true 2025-12-04T08:56:17.5276572Z fetch-tags: false 2025-12-04T08:56:17.5276757Z lfs: false 2025-12-04T08:56:17.5276928Z set-safe-directory: true 2025-12-04T08:56:17.5277129Z env: 2025-12-04T08:56:17.5277285Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:56:17.5277470Z ##[endgroup] 2025-12-04T08:56:17.6289350Z Syncing repository: pytorch/pytorch 2025-12-04T08:56:17.6290523Z ##[group]Getting Git version info 2025-12-04T08:56:17.6290917Z Working directory is '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2025-12-04T08:56:17.6291437Z [command]/usr/bin/git version 2025-12-04T08:56:17.6291642Z git version 2.50.1 2025-12-04T08:56:17.6301710Z ##[endgroup] 2025-12-04T08:56:17.6310982Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/78fdb1a0-6dc0-4d71-b7f8-5bef8e4ce858/.gitconfig' 2025-12-04T08:56:17.6328355Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/78fdb1a0-6dc0-4d71-b7f8-5bef8e4ce858' before making global git config changes 2025-12-04T08:56:17.6329264Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T08:56:17.6333837Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T08:56:17.6369378Z Deleting the contents of '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2025-12-04T08:56:17.6379026Z ##[group]Initializing the repository 2025-12-04T08:56:17.6382580Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T08:56:17.6420638Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-12-04T08:56:17.6421195Z hint: is subject to change. To configure the initial branch name to use in all 2025-12-04T08:56:17.6421691Z hint: of your new repositories, which will suppress this warning, call: 2025-12-04T08:56:17.6422059Z hint: 2025-12-04T08:56:17.6422349Z hint: git config --global init.defaultBranch 2025-12-04T08:56:17.6422672Z hint: 2025-12-04T08:56:17.6422972Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-12-04T08:56:17.6423502Z hint: 'development'. The just-created branch can be renamed via this command: 2025-12-04T08:56:17.6423883Z hint: 2025-12-04T08:56:17.6424087Z hint: git branch -m 2025-12-04T08:56:17.6424334Z hint: 2025-12-04T08:56:17.6424686Z hint: Disable this message with "git config set advice.defaultBranchName false" 2025-12-04T08:56:17.6425307Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/ 2025-12-04T08:56:17.6430625Z [command]/usr/bin/git remote add origin https://github.com/pytorch/pytorch 2025-12-04T08:56:17.6458657Z ##[endgroup] 2025-12-04T08:56:17.6459061Z ##[group]Disabling automatic garbage collection 2025-12-04T08:56:17.6462278Z [command]/usr/bin/git config --local gc.auto 0 2025-12-04T08:56:17.6488505Z ##[endgroup] 2025-12-04T08:56:17.6488847Z ##[group]Setting up auth 2025-12-04T08:56:17.6494371Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T08:56:17.6520983Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T08:56:17.6865097Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T08:56:17.6891366Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T08:56:17.7211761Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:56:17.7239938Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T08:56:17.7549967Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T08:56:17.7590924Z ##[endgroup] 2025-12-04T08:56:17.7591278Z ##[group]Fetching the repository 2025-12-04T08:56:17.7598066Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2025-12-04T08:56:58.8733608Z From https://github.com/pytorch/pytorch 2025-12-04T08:56:58.8734265Z * [new branch] 2.6.0.dev20241004+ -> origin/2.6.0.dev20241004+ 2025-12-04T08:56:58.8734728Z * [new branch] 2.9.1 -> origin/2.9.1 2025-12-04T08:56:58.8735235Z * [new branch] AaronWang04_addmmfusion_perftest -> origin/AaronWang04_addmmfusion_perftest 2025-12-04T08:56:58.8735818Z * [new branch] Flamefire-patch-1 -> origin/Flamefire-patch-1 2025-12-04T08:56:58.8736735Z * [new branch] HDCharles-2.6.0-release-notes -> origin/HDCharles-2.6.0-release-notes 2025-12-04T08:56:58.8738455Z * [new branch] HOPrintFunc -> origin/HOPrintFunc 2025-12-04T08:56:58.8741287Z * [new branch] IvanKobzarev/stack/1 -> origin/IvanKobzarev/stack/1 2025-12-04T08:56:58.8743882Z * [new branch] NicoshevSVE128 -> origin/NicoshevSVE128 2025-12-04T08:56:58.8744996Z * [new branch] PR-AOTInductorNoneBug -> origin/PR-AOTInductorNoneBug 2025-12-04T08:56:58.8746928Z * [new branch] PR-AOTInductorNoneBugFix -> origin/PR-AOTInductorNoneBugFix 2025-12-04T08:56:58.8748392Z * [new branch] PR-FixConfigsIssue -> origin/PR-FixConfigsIssue 2025-12-04T08:56:58.8749967Z * [new branch] PR-NoneBugFix-viable -> origin/PR-NoneBugFix-viable 2025-12-04T08:56:58.8751728Z * [new branch] PR-ResetToZero -> origin/PR-ResetToZero 2025-12-04T08:56:58.8753315Z * [new branch] Update-Flash-Packaging -> origin/Update-Flash-Packaging 2025-12-04T08:56:58.8754905Z * [new branch] VLA_exp -> origin/VLA_exp 2025-12-04T08:56:58.8756842Z * [new branch] activation_bench -> origin/activation_bench 2025-12-04T08:56:58.8759118Z * [new branch] addmm-heuristic -> origin/addmm-heuristic 2025-12-04T08:56:58.8761298Z * [new branch] adi/onednn_aarch64 -> origin/adi/onednn_aarch64 2025-12-04T08:56:58.8763249Z * [new branch] adi/test -> origin/adi/test 2025-12-04T08:56:58.8764789Z * [new branch] adi/test_bgemm -> origin/adi/test_bgemm 2025-12-04T08:56:58.8766369Z * [new branch] adi/test_m8g -> origin/adi/test_m8g 2025-12-04T08:56:58.8767983Z * [new branch] adi/test_onednn -> origin/adi/test_onednn 2025-12-04T08:56:58.8769604Z * [new branch] adi/test_onednn_v3.9 -> origin/adi/test_onednn_v3.9 2025-12-04T08:56:58.8771225Z * [new branch] adi/test_presve_change -> origin/adi/test_presve_change 2025-12-04T08:56:58.8772836Z * [new branch] adi/test_timm -> origin/adi/test_timm 2025-12-04T08:56:58.8774832Z * [new branch] adi/testpresve_change -> origin/adi/testpresve_change 2025-12-04T08:56:58.8777645Z * [new branch] aditew01/test/vec_bf16 -> origin/aditew01/test/vec_bf16 2025-12-04T08:56:58.8779248Z * [new branch] ah-globalfeedback-hook -> origin/ah-globalfeedback-hook 2025-12-04T08:56:58.8781160Z * [new branch] albanD-patch-1 -> origin/albanD-patch-1 2025-12-04T08:56:58.8782622Z * [new branch] also-surround-shimh -> origin/also-surround-shimh 2025-12-04T08:56:58.8785107Z * [new branch] angelayi/aot_compile -> origin/angelayi/aot_compile 2025-12-04T08:56:58.8786472Z * [new branch] angelayi/aoti_additional_files -> origin/angelayi/aoti_additional_files 2025-12-04T08:56:58.8788062Z * [new branch] angelayi/benchmark -> origin/angelayi/benchmark 2025-12-04T08:56:58.8789750Z * [new branch] angelayi/change_pytree_serialization -> origin/angelayi/change_pytree_serialization 2025-12-04T08:56:58.8791100Z * [new branch] angelayi/cpp_loader -> origin/angelayi/cpp_loader 2025-12-04T08:56:58.8792844Z * [new branch] angelayi/inductor_const -> origin/angelayi/inductor_const 2025-12-04T08:56:58.8794340Z * [new branch] angelayi/lstm -> origin/angelayi/lstm 2025-12-04T08:56:58.8796379Z * [new branch] angelayi/no_so_weight -> origin/angelayi/no_so_weight 2025-12-04T08:56:58.8798308Z * [new branch] angelayi/scan_layers -> origin/angelayi/scan_layers 2025-12-04T08:56:58.8799942Z * [new branch] angelayi/side_eff -> origin/angelayi/side_eff 2025-12-04T08:56:58.8801558Z * [new branch] angelayi/state_dict -> origin/angelayi/state_dict 2025-12-04T08:56:58.8803248Z * [new branch] angelayi/symint_input -> origin/angelayi/symint_input 2025-12-04T08:56:58.8805120Z * [new branch] angelayi/symm_mem -> origin/angelayi/symm_mem 2025-12-04T08:56:58.8806509Z * [new branch] angelayi/test_cpp -> origin/angelayi/test_cpp 2025-12-04T08:56:58.8808271Z * [new branch] angelayi/torch_size -> origin/angelayi/torch_size 2025-12-04T08:56:58.8809822Z * [new branch] annotate_assert -> origin/annotate_assert 2025-12-04T08:56:58.8811510Z * [new branch] annotate_fallback_kernel -> origin/annotate_fallback_kernel 2025-12-04T08:56:58.8813285Z * [new branch] annotation_deepcopy -> origin/annotation_deepcopy 2025-12-04T08:56:58.8814877Z * [new branch] annotation_dynamo -> origin/annotation_dynamo 2025-12-04T08:56:58.8816484Z * [new branch] aot_eager_stack_trace -> origin/aot_eager_stack_trace 2025-12-04T08:56:58.8818144Z * [new branch] aoti-cuda-alloc -> origin/aoti-cuda-alloc 2025-12-04T08:56:58.8819728Z * [new branch] aoti_const_device -> origin/aoti_const_device 2025-12-04T08:56:58.8821376Z * [new branch] aoti_fqn_name_interface -> origin/aoti_fqn_name_interface 2025-12-04T08:56:58.8822941Z * [new branch] aoti_package_weights_binary -> origin/aoti_package_weights_binary 2025-12-04T08:56:58.8824529Z * [new branch] aoti_target_windows -> origin/aoti_target_windows 2025-12-04T08:56:58.8827276Z * [new branch] arsh/feat/inductor_check_profiling -> origin/arsh/feat/inductor_check_profiling 2025-12-04T08:56:58.8828944Z * [new branch] async_tp -> origin/async_tp 2025-12-04T08:56:58.8830642Z * [new branch] atalman-inductor-perf-cu124 -> origin/atalman-inductor-perf-cu124 2025-12-04T08:56:58.8832305Z * [new branch] atalman-inductor-perf-cu124.1 -> origin/atalman-inductor-perf-cu124.1 2025-12-04T08:56:58.8833927Z * [new branch] atalman-patch-2 -> origin/atalman-patch-2 2025-12-04T08:56:58.8835664Z * [new branch] atalman-patch-3 -> origin/atalman-patch-3 2025-12-04T08:56:58.8837308Z * [new branch] atalman-patch-4 -> origin/atalman-patch-4 2025-12-04T08:56:58.8838990Z * [new branch] atalman-patch-5 -> origin/atalman-patch-5 2025-12-04T08:56:58.8840674Z * [new branch] atalman-patch-6 -> origin/atalman-patch-6 2025-12-04T08:56:58.8842294Z * [new branch] atalman-patch-7 -> origin/atalman-patch-7 2025-12-04T08:56:58.8844011Z * [new branch] atalman-patch-8 -> origin/atalman-patch-8 2025-12-04T08:56:58.8845678Z * [new branch] atalman_inductor_2.3.1 -> origin/atalman_inductor_2.3.1 2025-12-04T08:56:58.8847308Z * [new branch] atalman_inductor_2.4.0 -> origin/atalman_inductor_2.4.0 2025-12-04T08:56:58.8848969Z * [new branch] atalman_inductor_2.4.x -> origin/atalman_inductor_2.4.x 2025-12-04T08:56:58.8850686Z * [new branch] attention_benchmarking_clean -> origin/attention_benchmarking_clean 2025-12-04T08:56:58.8852951Z * [new branch] bahuang/dt_fix_scalar_add -> origin/bahuang/dt_fix_scalar_add 2025-12-04T08:56:58.8854473Z * [new branch] bahuang/fix_debug_mode -> origin/bahuang/fix_debug_mode 2025-12-04T08:56:58.8856032Z * [new branch] bahuang/fix_expand -> origin/bahuang/fix_expand 2025-12-04T08:56:58.8857608Z * [new branch] bahuang/test -> origin/bahuang/test 2025-12-04T08:56:58.8859942Z * [new branch] base/1.5 -> origin/base/1.5 2025-12-04T08:56:58.8861969Z * [new branch] batching_sdpa_efficient_attention -> origin/batching_sdpa_efficient_attention 2025-12-04T08:56:58.8863856Z * [new branch] bench_scaled_mm_ops -> origin/bench_scaled_mm_ops 2025-12-04T08:56:58.8865618Z * [new branch] benchmark-updates -> origin/benchmark-updates 2025-12-04T08:56:58.8867024Z * [new branch] benchmarking-script -> origin/benchmarking-script 2025-12-04T08:56:58.8869258Z * [new branch] bertmaher/pinbump26 -> origin/bertmaher/pinbump26 2025-12-04T08:56:58.8871409Z * [new branch] bertrand/cutlass -> origin/bertrand/cutlass 2025-12-04T08:56:58.8873624Z * [new branch] bf/bug-static-input -> origin/bf/bug-static-input 2025-12-04T08:56:58.8875148Z * [new branch] bf/cg-backend -> origin/bf/cg-backend 2025-12-04T08:56:58.8876690Z * [new branch] bf/cg-nccl-test -> origin/bf/cg-nccl-test 2025-12-04T08:56:58.8878166Z * [new branch] bf/cg-remove-check -> origin/bf/cg-remove-check 2025-12-04T08:56:58.8879793Z * [new branch] bf/clean-torchbench-hf -> origin/bf/clean-torchbench-hf 2025-12-04T08:56:58.8881293Z * [new branch] bf/combo-debug-log -> origin/bf/combo-debug-log 2025-12-04T08:56:58.8882869Z * [new branch] bf/cudagraph -> origin/bf/cudagraph 2025-12-04T08:56:58.8884987Z * [new branch] bf/cudagraph-disable-input-mutation -> origin/bf/cudagraph-disable-input-mutation 2025-12-04T08:56:58.8886878Z * [new branch] bf/cudagraph-enable-input-mutation-support-benchmark -> origin/bf/cudagraph-enable-input-mutation-support-benchmark 2025-12-04T08:56:58.8888071Z * [new branch] bf/cudagraph-partition -> origin/bf/cudagraph-partition 2025-12-04T08:56:58.8889830Z * [new branch] bf/donated-buffer-bench -> origin/bf/donated-buffer-bench 2025-12-04T08:56:58.8891506Z * [new branch] bf/dynamo-partition -> origin/bf/dynamo-partition 2025-12-04T08:56:58.8893196Z * [new branch] bf/lite -> origin/bf/lite 2025-12-04T08:56:58.8894851Z * [new branch] bf/pa-non-divisible -> origin/bf/pa-non-divisible 2025-12-04T08:56:58.8896555Z * [new branch] bf/partition-cache-free-symbols -> origin/bf/partition-cache-free-symbols 2025-12-04T08:56:58.8898276Z * [new branch] bf/partition-memory-plan -> origin/bf/partition-memory-plan 2025-12-04T08:56:58.8899910Z * [new branch] bf/partition-move-cpu -> origin/bf/partition-move-cpu 2025-12-04T08:56:58.8901615Z * [new branch] bf/partition-view-fallback -> origin/bf/partition-view-fallback 2025-12-04T08:56:58.8903192Z * [new branch] bf/remove-check-55b0c39d -> origin/bf/remove-check-55b0c39d 2025-12-04T08:56:58.8904773Z * [new branch] bf/timm-nov-26-2025 -> origin/bf/timm-nov-26-2025 2025-12-04T08:56:58.8906463Z * [new branch] bf/transformer-pin-4-57-3 -> origin/bf/transformer-pin-4-57-3 2025-12-04T08:56:58.8908119Z * [new branch] bisect_perf_hf_T5_3acc6eac492 -> origin/bisect_perf_hf_T5_3acc6eac492 2025-12-04T08:56:58.8909712Z * [new branch] bisect_perf_hf_T5_3fcf66f61fb -> origin/bisect_perf_hf_T5_3fcf66f61fb 2025-12-04T08:56:58.8911316Z * [new branch] bisect_perf_hf_T5_4009d154129 -> origin/bisect_perf_hf_T5_4009d154129 2025-12-04T08:56:58.8912838Z * [new branch] bisect_perf_hf_T5_40d0740e73d -> origin/bisect_perf_hf_T5_40d0740e73d 2025-12-04T08:56:58.8914400Z * [new branch] bisect_perf_hf_T5_5268754e -> origin/bisect_perf_hf_T5_5268754e 2025-12-04T08:56:58.8915955Z * [new branch] bisect_perf_hf_T5_7d89a8d385c -> origin/bisect_perf_hf_T5_7d89a8d385c 2025-12-04T08:56:58.8917534Z * [new branch] bisect_perf_hf_T5_b7a25c1ee7c -> origin/bisect_perf_hf_T5_b7a25c1ee7c 2025-12-04T08:56:58.8919065Z * [new branch] bisect_perf_hf_T5_c25b201583f -> origin/bisect_perf_hf_T5_c25b201583f 2025-12-04T08:56:58.8920680Z * [new branch] bisect_perf_hf_T5_c93e57efac0 -> origin/bisect_perf_hf_T5_c93e57efac0 2025-12-04T08:56:58.8922542Z * [new branch] bisect_perf_hf_T5_ca9813ea149 -> origin/bisect_perf_hf_T5_ca9813ea149 2025-12-04T08:56:58.8923793Z * [new branch] bisect_perf_hf_T5_d65f194a -> origin/bisect_perf_hf_T5_d65f194a 2025-12-04T08:56:58.8925493Z * [new branch] bisect_perf_hf_T5_da94ab0b -> origin/bisect_perf_hf_T5_da94ab0b 2025-12-04T08:56:58.8927062Z * [new branch] bisect_perf_hf_T5_da94ab0b_new -> origin/bisect_perf_hf_T5_da94ab0b_new 2025-12-04T08:56:58.8928674Z * [new branch] bisect_perf_hf_T5_db4e8a1d8a8 -> origin/bisect_perf_hf_T5_db4e8a1d8a8 2025-12-04T08:56:58.8930212Z * [new branch] bisect_perf_hf_T5_e0d97e936a2 -> origin/bisect_perf_hf_T5_e0d97e936a2 2025-12-04T08:56:58.8931748Z * [new branch] bisect_perf_hf_T5_f23621ec563 -> origin/bisect_perf_hf_T5_f23621ec563 2025-12-04T08:56:58.8934539Z * [new branch] brister/fx_device_type -> origin/brister/fx_device_type 2025-12-04T08:56:58.8936042Z * [new branch] brister/test_inductor_all_fx -> origin/brister/test_inductor_all_fx 2025-12-04T08:56:58.8937613Z * [new branch] brister/tiled_reduction_no_numel_check -> origin/brister/tiled_reduction_no_numel_check 2025-12-04T08:56:58.8939143Z * [new branch] bwd-backup -> origin/bwd-backup 2025-12-04T08:56:58.8940871Z * [new branch] c57382a49 -> origin/c57382a49 2025-12-04T08:56:58.8942428Z * [new branch] ca_0431d47eaa -> origin/ca_0431d47eaa 2025-12-04T08:56:58.8943934Z * [new branch] ca_fix_0431d47eaa -> origin/ca_fix_0431d47eaa 2025-12-04T08:56:58.8946163Z * [new branch] camyllh/test_setup_hooks_push -> origin/camyllh/test_setup_hooks_push 2025-12-04T08:56:58.8947821Z * [new branch] cccclai-patch-1 -> origin/cccclai-patch-1 2025-12-04T08:56:58.8949561Z * [new branch] cherry-pick-159969-by-pytorch_bot_bot_ -> origin/cherry-pick-159969-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8951207Z * [new branch] cherry-pick-160586-by-pytorch_bot_bot_ -> origin/cherry-pick-160586-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8952938Z * [new branch] cherry-pick-162208-by-pytorch_bot_bot_ -> origin/cherry-pick-162208-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8954527Z * [new branch] cherry-pick-163169-by-pytorch_bot_bot_ -> origin/cherry-pick-163169-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8956185Z * [new branch] cherry-pick-165086-by-pytorch_bot_bot_ -> origin/cherry-pick-165086-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8957920Z * [new branch] cherry-pick-165514-by-pytorch_bot_bot_ -> origin/cherry-pick-165514-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8959527Z * [new branch] cherry-pick-165601-by-pytorch_bot_bot_ -> origin/cherry-pick-165601-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8961168Z * [new branch] cherry-pick-165667-by-pytorch_bot_bot_ -> origin/cherry-pick-165667-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8963265Z * [new branch] cherry-pick-165815-by-pytorch_bot_bot_ -> origin/cherry-pick-165815-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8964926Z * [new branch] cherry-pick-165922-by-pytorch_bot_bot_ -> origin/cherry-pick-165922-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8966593Z * [new branch] cherry-pick-166148-by-pytorch_bot_bot_ -> origin/cherry-pick-166148-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8968151Z * [new branch] cherry-pick-166181-by-pytorch_bot_bot_ -> origin/cherry-pick-166181-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8970252Z * [new branch] cherry-pick-166404-by-pytorch_bot_bot_ -> origin/cherry-pick-166404-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8971931Z * [new branch] cherry-pick-166427-by-pytorch_bot_bot_ -> origin/cherry-pick-166427-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8973894Z * [new branch] cherry-pick-166480-by-pytorch_bot_bot_ -> origin/cherry-pick-166480-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8975216Z * [new branch] cherry-pick-166570-by-pytorch_bot_bot_ -> origin/cherry-pick-166570-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8976994Z * [new branch] cherry-pick-166993-by-pytorch_bot_bot_ -> origin/cherry-pick-166993-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8978578Z * [new branch] cherry-pick-167111-by-pytorch_bot_bot_ -> origin/cherry-pick-167111-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8980331Z * [new branch] cherry-pick-167478-by-pytorch_bot_bot_ -> origin/cherry-pick-167478-by-pytorch_bot_bot_ 2025-12-04T08:56:58.8981834Z * [new branch] cherry_pick_166036_166040 -> origin/cherry_pick_166036_166040 2025-12-04T08:56:58.8983410Z * [new branch] cherry_pick_166457 -> origin/cherry_pick_166457 2025-12-04T08:56:58.8985099Z * [new branch] cherrypick_166338 -> origin/cherrypick_166338 2025-12-04T08:56:58.8986746Z * [new branch] cherrypick_166458 -> origin/cherrypick_166458 2025-12-04T08:56:58.8988328Z * [new branch] cherrypick_166586 -> origin/cherrypick_166586 2025-12-04T08:56:58.8989952Z * [new branch] cherrypick_166956 -> origin/cherrypick_166956 2025-12-04T08:56:58.8991551Z * [new branch] ci_attn -> origin/ci_attn 2025-12-04T08:56:58.8993242Z * [new branch] codex-testing -> origin/codex-testing 2025-12-04T08:56:58.8995650Z * [new branch] codex/add-check_memory_overlap-helper-functions -> origin/codex/add-check_memory_overlap-helper-functions 2025-12-04T08:56:58.8996960Z * [new branch] codex/fix-issue-121219-in-pytorch -> origin/codex/fix-issue-121219-in-pytorch 2025-12-04T08:56:58.8999052Z * [new branch] codex/investigate-segfaults-in-get_tensor_storage_id -> origin/codex/investigate-segfaults-in-get_tensor_storage_id 2025-12-04T08:56:58.9001146Z * [new branch] codex/refactor-lintrunner-config-to-use-uv-run -> origin/codex/refactor-lintrunner-config-to-use-uv-run 2025-12-04T08:56:58.9002431Z * [new branch] compatiblpy39util -> origin/compatiblpy39util 2025-12-04T08:56:58.9004103Z * [new branch] cond_hop_device -> origin/cond_hop_device 2025-12-04T08:56:58.9005708Z * [new branch] context_test -> origin/context_test 2025-12-04T08:56:58.9007926Z * [new branch] copilot/code-style-cleanup-python-pip -> origin/copilot/code-style-cleanup-python-pip 2025-12-04T08:56:58.9010007Z * [new branch] cpio/fix_new_ami_tests -> origin/cpio/fix_new_ami_tests 2025-12-04T08:56:58.9011666Z * [new branch] cpp-docs-dependency-upgrade -> origin/cpp-docs-dependency-upgrade 2025-12-04T08:56:58.9014011Z * [new branch] csl/always_produce_xml -> origin/csl/always_produce_xml 2025-12-04T08:56:58.9015454Z * [new branch] csl/build_test_more_procs -> origin/csl/build_test_more_procs 2025-12-04T08:56:58.9017015Z * [new branch] csl/build_test_more_procs2 -> origin/csl/build_test_more_procs2 2025-12-04T08:56:58.9018577Z * [new branch] csl/clean_up -> origin/csl/clean_up 2025-12-04T08:56:58.9020135Z * [new branch] csl/fix_retry_segfault_exit -> origin/csl/fix_retry_segfault_exit 2025-12-04T08:56:58.9021637Z * [new branch] csl/katex -> origin/csl/katex 2025-12-04T08:56:58.9023418Z * [new branch] csl/larger_runner -> origin/csl/larger_runner 2025-12-04T08:56:58.9025417Z * [new branch] csl/lint_testing -> origin/csl/lint_testing 2025-12-04T08:56:58.9027327Z * [new branch] csl/lint_thing -> origin/csl/lint_thing 2025-12-04T08:56:58.9028956Z * [new branch] csl/lintrunner_stuff -> origin/csl/lintrunner_stuff 2025-12-04T08:56:58.9030651Z * [new branch] csl/manually_gen_json -> origin/csl/manually_gen_json 2025-12-04T08:56:58.9032313Z * [new branch] csl/mps_sharding -> origin/csl/mps_sharding 2025-12-04T08:56:58.9033911Z * [new branch] csl/multistage_docker -> origin/csl/multistage_docker 2025-12-04T08:56:58.9035550Z * [new branch] csl/print_timing -> origin/csl/print_timing 2025-12-04T08:56:58.9037159Z * [new branch] csl/remove_experiment -> origin/csl/remove_experiment 2025-12-04T08:56:58.9038827Z * [new branch] csl/remove_maybe_unused_var -> origin/csl/remove_maybe_unused_var 2025-12-04T08:56:58.9040505Z * [new branch] csl/remove_repo_specific_autolabel -> origin/csl/remove_repo_specific_autolabel 2025-12-04T08:56:58.9042239Z * [new branch] csl/remove_run_parallel -> origin/csl/remove_run_parallel 2025-12-04T08:56:58.9043749Z * [new branch] csl/remove_unused_vars -> origin/csl/remove_unused_vars 2025-12-04T08:56:58.9045411Z * [new branch] csl/revert_open -> origin/csl/revert_open 2025-12-04T08:56:58.9047008Z * [new branch] csl/skip_build -> origin/csl/skip_build 2025-12-04T08:56:58.9048638Z * [new branch] csl/smaller_avx_amx_runenrs -> origin/csl/smaller_avx_amx_runenrs 2025-12-04T08:56:58.9050171Z * [new branch] csl/td_job_level -> origin/csl/td_job_level 2025-12-04T08:56:58.9051819Z * [new branch] csl/test_cuda_build_large_runner -> origin/csl/test_cuda_build_large_runner 2025-12-04T08:56:58.9053759Z * [new branch] csl/test_owners_autograd_dispatch_nn -> origin/csl/test_owners_autograd_dispatch_nn 2025-12-04T08:56:58.9055373Z * [new branch] csl/test_owners_higher_confidence -> origin/csl/test_owners_higher_confidence 2025-12-04T08:56:58.9057449Z * [new branch] csl/upload_json_running -> origin/csl/upload_json_running 2025-12-04T08:56:58.9059061Z * [new branch] csl/win_sccache -> origin/csl/win_sccache 2025-12-04T08:56:58.9060642Z * [new branch] csl/xml_stuff -> origin/csl/xml_stuff 2025-12-04T08:56:58.9062522Z * [new branch] cublasrelax2 -> origin/cublasrelax2 2025-12-04T08:56:58.9064504Z * [new branch] cuda_mempool -> origin/cuda_mempool 2025-12-04T08:56:58.9066126Z * [new branch] custom_lowering_dict -> origin/custom_lowering_dict 2025-12-04T08:56:58.9068280Z * [new branch] d4l3k/debug_plane_frtrace -> origin/d4l3k/debug_plane_frtrace 2025-12-04T08:56:58.9070471Z * [new branch] daxia6/2.8o3 -> origin/daxia6/2.8o3 2025-12-04T08:56:58.9072030Z * [new branch] debug-guard -> origin/debug-guard 2025-12-04T08:56:58.9073844Z * [new branch] delete-quant-docs -> origin/delete-quant-docs 2025-12-04T08:56:58.9078766Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 2025-12-04T08:56:58.9080375Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 2025-12-04T08:56:58.9082407Z * [new branch] desertfire/test_cpp_wrapper -> origin/desertfire/test_cpp_wrapper 2025-12-04T08:56:58.9084037Z * [new branch] desertfire/triton-cpu-for-aarch64 -> origin/desertfire/triton-cpu-for-aarch64 2025-12-04T08:56:58.9086597Z * [new branch] dev/dhruva/flex_attn_opt -> origin/dev/dhruva/flex_attn_opt 2025-12-04T08:56:58.9088940Z * [new branch] dev/joona/MPSNDArrayAdd -> origin/dev/joona/MPSNDArrayAdd 2025-12-04T08:56:58.9090629Z * [new branch] dev/joona/Unranked -> origin/dev/joona/Unranked 2025-12-04T08:56:58.9092750Z * [new branch] dev/joona/cat -> origin/dev/joona/cat 2025-12-04T08:56:58.9094112Z * [new branch] dev/joona/embeddingbag -> origin/dev/joona/embeddingbag 2025-12-04T08:56:58.9095936Z * [new branch] dev/joona/fix_sdpa_memtest -> origin/dev/joona/fix_sdpa_memtest 2025-12-04T08:56:58.9097645Z * [new branch] dev/joona/getTensorsString -> origin/dev/joona/getTensorsString 2025-12-04T08:56:58.9099471Z * [new branch] dev/joona/mps_linear_macos14 -> origin/dev/joona/mps_linear_macos14 2025-12-04T08:56:58.9101529Z * [new branch] dev/joona/scalar_clamp -> origin/dev/joona/scalar_clamp 2025-12-04T08:56:58.9103560Z * [new branch] dev/joona/sdpa -> origin/dev/joona/sdpa 2025-12-04T08:56:58.9105778Z * [new branch] dev/joona/sdpa_api -> origin/dev/joona/sdpa_api 2025-12-04T08:56:58.9107624Z * [new branch] dev/joona/type_inf -> origin/dev/joona/type_inf 2025-12-04T08:56:58.9109428Z * [new branch] dev/joona/ulpAssertClose -> origin/dev/joona/ulpAssertClose 2025-12-04T08:56:58.9111213Z * [new branch] dev/joona/upsize3d -> origin/dev/joona/upsize3d 2025-12-04T08:56:58.9112761Z * [new branch] disp_counter -> origin/disp_counter 2025-12-04T08:56:58.9114426Z * [new branch] divyanshk-patch-1 -> origin/divyanshk-patch-1 2025-12-04T08:56:58.9115959Z * [new branch] docs -> origin/docs 2025-12-04T08:56:58.9117827Z * [new branch] documentation -> origin/documentation 2025-12-04T08:56:58.9119414Z * [new branch] eager_model_benchmarks -> origin/eager_model_benchmarks 2025-12-04T08:56:58.9121640Z * [new branch] embg/test_inductor_ci_control -> origin/embg/test_inductor_ci_control 2025-12-04T08:56:58.9123128Z * [new branch] embg/triton_l2_prefetch_128B -> origin/embg/triton_l2_prefetch_128B 2025-12-04T08:56:58.9124554Z * [new branch] embg/triton_l2_prefetch_256B -> origin/embg/triton_l2_prefetch_256B 2025-12-04T08:56:58.9126196Z * [new branch] eqy-patch-1 -> origin/eqy-patch-1 2025-12-04T08:56:58.9127871Z * [new branch] eqy-patch-2 -> origin/eqy-patch-2 2025-12-04T08:56:58.9129500Z * [new branch] eqy-patch-3 -> origin/eqy-patch-3 2025-12-04T08:56:58.9131165Z * [new branch] eqy-patch-4 -> origin/eqy-patch-4 2025-12-04T08:56:58.9132949Z * [new branch] eqy-patch-5 -> origin/eqy-patch-5 2025-12-04T08:56:58.9134490Z * [new branch] eqy-patch-6 -> origin/eqy-patch-6 2025-12-04T08:56:58.9136607Z * [new branch] exclamaforte/amd-ma -> origin/exclamaforte/amd-ma 2025-12-04T08:56:58.9138290Z * [new branch] exclamaforte/combo-kernels-perf-run -> origin/exclamaforte/combo-kernels-perf-run 2025-12-04T08:56:58.9139778Z * [new branch] exclamaforte/do_bench_refactor -> origin/exclamaforte/do_bench_refactor 2025-12-04T08:56:58.9141350Z * [new branch] exclamaforte/enable-mem-dep-fusion -> origin/exclamaforte/enable-mem-dep-fusion 2025-12-04T08:56:58.9142956Z * [new branch] exclamaforte/fix-exhaustive-autotuning -> origin/exclamaforte/fix-exhaustive-autotuning 2025-12-04T08:56:58.9144843Z * [new branch] exclamaforte/fix-trace-parsing-fx-svg -> origin/exclamaforte/fix-trace-parsing-fx-svg 2025-12-04T08:56:58.9146793Z * [new branch] exclamaforte/force-pointwise-cat-perf-run -> origin/exclamaforte/force-pointwise-cat-perf-run 2025-12-04T08:56:58.9148307Z * [new branch] exclamaforte/fusion-data -> origin/exclamaforte/fusion-data 2025-12-04T08:56:58.9150008Z * [new branch] exclamaforte/gemm-benchmark-run -> origin/exclamaforte/gemm-benchmark-run 2025-12-04T08:56:58.9151682Z * [new branch] exclamaforte/gemm-export-model -> origin/exclamaforte/gemm-export-model 2025-12-04T08:56:58.9153115Z * [new branch] exclamaforte/gemm-model -> origin/exclamaforte/gemm-model 2025-12-04T08:56:58.9154936Z * [new branch] exclamaforte/gemm-model-all-data-collection -> origin/exclamaforte/gemm-model-all-data-collection 2025-12-04T08:56:58.9156387Z * [new branch] exclamaforte/gemm-to-amd -> origin/exclamaforte/gemm-to-amd 2025-12-04T08:56:58.9157964Z * [new branch] exclamaforte/just-gemm-model -> origin/exclamaforte/just-gemm-model 2025-12-04T08:56:58.9159698Z * [new branch] exclamaforte/just-gemm-model-no-refactor -> origin/exclamaforte/just-gemm-model-no-refactor 2025-12-04T08:56:58.9161282Z * [new branch] exclamaforte/profile-diff-algo -> origin/exclamaforte/profile-diff-algo 2025-12-04T08:56:58.9163229Z * [new branch] exclamaforte/profiler-visualization -> origin/exclamaforte/profiler-visualization 2025-12-04T08:56:58.9165074Z * [new branch] exclamaforte/test_cpp_wrapper_mode -> origin/exclamaforte/test_cpp_wrapper_mode 2025-12-04T08:56:58.9166768Z * [new branch] exclamaforte/update-autotune-configs -> origin/exclamaforte/update-autotune-configs 2025-12-04T08:56:58.9168351Z * [new branch] exclamaforte/update-autotune-configs-2 -> origin/exclamaforte/update-autotune-configs-2 2025-12-04T08:56:58.9169878Z * [new branch] exec -> origin/exec 2025-12-04T08:56:58.9171695Z * [new branch] experimental-mosaic -> origin/experimental-mosaic 2025-12-04T08:56:58.9173532Z * [new branch] export-D61047529 -> origin/export-D61047529 2025-12-04T08:56:58.9175107Z * [new branch] export-D71412006 -> origin/export-D71412006 2025-12-04T08:56:58.9176874Z * [new branch] export-D73042989 -> origin/export-D73042989 2025-12-04T08:56:58.9178405Z * [new branch] export-D78957093 -> origin/export-D78957093 2025-12-04T08:56:58.9179968Z * [new branch] export-D78996107 -> origin/export-D78996107 2025-12-04T08:56:58.9181558Z * [new branch] export-D80823877 -> origin/export-D80823877 2025-12-04T08:56:58.9183235Z * [new branch] export-D80958642 -> origin/export-D80958642 2025-12-04T08:56:58.9184819Z * [new branch] export-D81054193 -> origin/export-D81054193 2025-12-04T08:56:58.9186352Z * [new branch] export-D81204584 -> origin/export-D81204584 2025-12-04T08:56:58.9187967Z * [new branch] export-D81429090 -> origin/export-D81429090 2025-12-04T08:56:58.9189590Z * [new branch] export-D82250826 -> origin/export-D82250826 2025-12-04T08:56:58.9191173Z * [new branch] export-D82253817 -> origin/export-D82253817 2025-12-04T08:56:58.9192822Z * [new branch] export-D83541846 -> origin/export-D83541846 2025-12-04T08:56:58.9194447Z * [new branch] export-D83627170 -> origin/export-D83627170 2025-12-04T08:56:58.9196065Z * [new branch] export-D83766701 -> origin/export-D83766701 2025-12-04T08:56:58.9197589Z * [new branch] export-D83768878 -> origin/export-D83768878 2025-12-04T08:56:58.9199281Z * [new branch] export-D83769447 -> origin/export-D83769447 2025-12-04T08:56:58.9200825Z * [new branch] export-D84089824 -> origin/export-D84089824 2025-12-04T08:56:58.9202409Z * [new branch] export-D84213020 -> origin/export-D84213020 2025-12-04T08:56:58.9204268Z * [new branch] export-D84373821 -> origin/export-D84373821 2025-12-04T08:56:58.9206163Z * [new branch] export-D84612194 -> origin/export-D84612194 2025-12-04T08:56:58.9207983Z * [new branch] export-D84890985 -> origin/export-D84890985 2025-12-04T08:56:58.9209480Z * [new branch] export-D85122326 -> origin/export-D85122326 2025-12-04T08:56:58.9211114Z * [new branch] export-D86256198 -> origin/export-D86256198 2025-12-04T08:56:58.9212808Z * [new branch] export-D86460608 -> origin/export-D86460608 2025-12-04T08:56:58.9215029Z * [new branch] export-D86474796 -> origin/export-D86474796 2025-12-04T08:56:58.9216736Z * [new branch] export-D86712396 -> origin/export-D86712396 2025-12-04T08:56:58.9218359Z * [new branch] export-D87022129 -> origin/export-D87022129 2025-12-04T08:56:58.9220077Z * [new branch] export-D87838959 -> origin/export-D87838959 2025-12-04T08:56:58.9221687Z * [new branch] export-D88319437 -> origin/export-D88319437 2025-12-04T08:56:58.9223470Z * [new branch] exported-model-train-idempotent -> origin/exported-model-train-idempotent 2025-12-04T08:56:58.9225047Z * [new branch] ezyang-titan-october -> origin/ezyang-titan-october 2025-12-04T08:56:58.9226617Z * [new branch] ezyang-titan-october2 -> origin/ezyang-titan-october2 2025-12-04T08:56:58.9228160Z * [new branch] ezyang-war -> origin/ezyang-war 2025-12-04T08:56:58.9230298Z * [new branch] ezyang/wip-aot-descriptors -> origin/ezyang/wip-aot-descriptors 2025-12-04T08:56:58.9231860Z * [new branch] fa_u8_brgemm -> origin/fa_u8_brgemm 2025-12-04T08:56:58.9234063Z * [new branch] fadeputr/sequence_fbgemm -> origin/fadeputr/sequence_fbgemm 2025-12-04T08:56:58.9235689Z * [new branch] fastmath_baseline -> origin/fastmath_baseline 2025-12-04T08:56:58.9237898Z * [new branch] fbcode/warm -> origin/fbcode/warm 2025-12-04T08:56:58.9239588Z * [new branch] fca -> origin/fca 2025-12-04T08:56:58.9241167Z * [new branch] fca2_ca5984c -> origin/fca2_ca5984c 2025-12-04T08:56:58.9242774Z * [new branch] fca5 -> origin/fca5 2025-12-04T08:56:58.9244971Z * [new branch] feature/justknobs-cpp -> origin/feature/justknobs-cpp 2025-12-04T08:56:58.9246549Z * [new branch] feature/numa-forkserver -> origin/feature/numa-forkserver 2025-12-04T08:56:58.9248485Z * [new branch] ffast_math_baseline -> origin/ffast_math_baseline 2025-12-04T08:56:58.9250024Z * [new branch] ffast_math_target -> origin/ffast_math_target 2025-12-04T08:56:58.9252100Z * [new branch] findhao/base_commit -> origin/findhao/base_commit 2025-12-04T08:56:58.9254199Z * [new branch] findhao/base_commit1 -> origin/findhao/base_commit1 2025-12-04T08:56:58.9255731Z * [new branch] findhao/multistream2 -> origin/findhao/multistream2 2025-12-04T08:56:58.9257268Z * [new branch] findhao/multistream5 -> origin/findhao/multistream5 2025-12-04T08:56:58.9258722Z * [new branch] findhao/multistream6 -> origin/findhao/multistream6 2025-12-04T08:56:58.9260253Z * [new branch] findhao/operatorbench3 -> origin/findhao/operatorbench3 2025-12-04T08:56:58.9261748Z * [new branch] findhao/operatorbench5 -> origin/findhao/operatorbench5 2025-12-04T08:56:58.9263530Z * [new branch] findhao/tritonparse -> origin/findhao/tritonparse 2025-12-04T08:56:58.9265182Z * [new branch] fix-ck-gemm-template-format -> origin/fix-ck-gemm-template-format 2025-12-04T08:56:58.9266795Z * [new branch] fix-config-ignore -> origin/fix-config-ignore 2025-12-04T08:56:58.9268427Z * [new branch] fix-dict-guard -> origin/fix-dict-guard 2025-12-04T08:56:58.9270303Z * [new branch] fix_addmm_issue -> origin/fix_addmm_issue 2025-12-04T08:56:58.9271741Z * [new branch] fix_amd_missing_cluster_dims -> origin/fix_amd_missing_cluster_dims 2025-12-04T08:56:58.9273392Z * [new branch] fix_bench_bwd_pass -> origin/fix_bench_bwd_pass 2025-12-04T08:56:58.9274972Z * [new branch] fix_mem_profiler_config -> origin/fix_mem_profiler_config 2025-12-04T08:56:58.9276560Z * [new branch] fix_nvrtc_discovery -> origin/fix_nvrtc_discovery 2025-12-04T08:56:58.9278086Z * [new branch] fix_op_runner -> origin/fix_op_runner 2025-12-04T08:56:58.9279714Z * [new branch] fix_ubn_159469 -> origin/fix_ubn_159469 2025-12-04T08:56:58.9281391Z * [new branch] fixes-triage -> origin/fixes-triage 2025-12-04T08:56:58.9282992Z * [new branch] fixflashinfer -> origin/fixflashinfer 2025-12-04T08:56:58.9284550Z * [new branch] flash_decoding_cpu -> origin/flash_decoding_cpu 2025-12-04T08:56:58.9286083Z * [new branch] flex-flash -> origin/flex-flash 2025-12-04T08:56:58.9287759Z * [new branch] flex_attention_functorch_grad -> origin/flex_attention_functorch_grad 2025-12-04T08:56:58.9289348Z * [new branch] flex_flash -> origin/flex_flash 2025-12-04T08:56:58.9291744Z * [new branch] fmassa/fix_memeff_sharding_rule -> origin/fmassa/fix_memeff_sharding_rule 2025-12-04T08:56:58.9293426Z * [new branch] fmassa/tests_comm_compute_scheduler -> origin/fmassa/tests_comm_compute_scheduler 2025-12-04T08:56:58.9294965Z * [new branch] forkserver_fix -> origin/forkserver_fix 2025-12-04T08:56:58.9296546Z * [new branch] fsdp2_trace_rules -> origin/fsdp2_trace_rules 2025-12-04T08:56:58.9298152Z * [new branch] fx_cpp -> origin/fx_cpp 2025-12-04T08:56:58.9300407Z * [new branch] fy/fix-win -> origin/fy/fix-win 2025-12-04T08:56:58.9302041Z * [new branch] galv-patch-1 -> origin/galv-patch-1 2025-12-04T08:56:58.9304285Z * [new branch] galv/cudagraphs-conditional-nodes-4 -> origin/galv/cudagraphs-conditional-nodes-4 2025-12-04T08:56:58.9306426Z * [new branch] georgehong/cmakelists-patch -> origin/georgehong/cmakelists-patch 2025-12-04T08:56:58.9309680Z * [new branch] gh/AlnisM/1/base -> origin/gh/AlnisM/1/base 2025-12-04T08:56:58.9311229Z * [new branch] gh/AlnisM/1/head -> origin/gh/AlnisM/1/head 2025-12-04T08:56:58.9313869Z * [new branch] gh/EikanWang/67/base -> origin/gh/EikanWang/67/base 2025-12-04T08:56:58.9315477Z * [new branch] gh/EikanWang/67/head -> origin/gh/EikanWang/67/head 2025-12-04T08:56:58.9318590Z * [new branch] gh/Gasoonjia/1/base -> origin/gh/Gasoonjia/1/base 2025-12-04T08:56:58.9319981Z * [new branch] gh/Gasoonjia/1/head -> origin/gh/Gasoonjia/1/head 2025-12-04T08:56:58.9322572Z * [new branch] gh/H-Huang/131/base -> origin/gh/H-Huang/131/base 2025-12-04T08:56:58.9324200Z * [new branch] gh/H-Huang/131/head -> origin/gh/H-Huang/131/head 2025-12-04T08:56:58.9325822Z * [new branch] gh/H-Huang/131/orig -> origin/gh/H-Huang/131/orig 2025-12-04T08:56:58.9327881Z * [new branch] gh/H-Huang/132/base -> origin/gh/H-Huang/132/base 2025-12-04T08:56:58.9329420Z * [new branch] gh/H-Huang/132/head -> origin/gh/H-Huang/132/head 2025-12-04T08:56:58.9331014Z * [new branch] gh/H-Huang/132/orig -> origin/gh/H-Huang/132/orig 2025-12-04T08:56:58.9333257Z * [new branch] gh/H-Huang/180/base -> origin/gh/H-Huang/180/base 2025-12-04T08:56:58.9334885Z * [new branch] gh/H-Huang/180/head -> origin/gh/H-Huang/180/head 2025-12-04T08:56:58.9336202Z * [new branch] gh/H-Huang/180/orig -> origin/gh/H-Huang/180/orig 2025-12-04T08:56:58.9338321Z * [new branch] gh/H-Huang/182/base -> origin/gh/H-Huang/182/base 2025-12-04T08:56:58.9339868Z * [new branch] gh/H-Huang/182/head -> origin/gh/H-Huang/182/head 2025-12-04T08:56:58.9341388Z * [new branch] gh/H-Huang/182/orig -> origin/gh/H-Huang/182/orig 2025-12-04T08:56:58.9343552Z * [new branch] gh/H-Huang/226/base -> origin/gh/H-Huang/226/base 2025-12-04T08:56:58.9345287Z * [new branch] gh/H-Huang/226/head -> origin/gh/H-Huang/226/head 2025-12-04T08:56:58.9346879Z * [new branch] gh/H-Huang/226/orig -> origin/gh/H-Huang/226/orig 2025-12-04T08:56:58.9348961Z * [new branch] gh/H-Huang/228/base -> origin/gh/H-Huang/228/base 2025-12-04T08:56:58.9350544Z * [new branch] gh/H-Huang/228/head -> origin/gh/H-Huang/228/head 2025-12-04T08:56:58.9352079Z * [new branch] gh/H-Huang/228/orig -> origin/gh/H-Huang/228/orig 2025-12-04T08:56:58.9354607Z * [new branch] gh/IvanKobzarev/150/base -> origin/gh/IvanKobzarev/150/base 2025-12-04T08:56:58.9356154Z * [new branch] gh/IvanKobzarev/150/head -> origin/gh/IvanKobzarev/150/head 2025-12-04T08:56:58.9357675Z * [new branch] gh/IvanKobzarev/150/orig -> origin/gh/IvanKobzarev/150/orig 2025-12-04T08:56:58.9359865Z * [new branch] gh/IvanKobzarev/157/base -> origin/gh/IvanKobzarev/157/base 2025-12-04T08:56:58.9361443Z * [new branch] gh/IvanKobzarev/157/head -> origin/gh/IvanKobzarev/157/head 2025-12-04T08:56:58.9364168Z * [new branch] gh/IvanKobzarev/157/orig -> origin/gh/IvanKobzarev/157/orig 2025-12-04T08:56:58.9366385Z * [new branch] gh/IvanKobzarev/159/base -> origin/gh/IvanKobzarev/159/base 2025-12-04T08:56:58.9367942Z * [new branch] gh/IvanKobzarev/159/head -> origin/gh/IvanKobzarev/159/head 2025-12-04T08:56:58.9369531Z * [new branch] gh/IvanKobzarev/159/orig -> origin/gh/IvanKobzarev/159/orig 2025-12-04T08:56:58.9371771Z * [new branch] gh/IvanKobzarev/162/base -> origin/gh/IvanKobzarev/162/base 2025-12-04T08:56:58.9373721Z * [new branch] gh/IvanKobzarev/162/head -> origin/gh/IvanKobzarev/162/head 2025-12-04T08:56:58.9375230Z * [new branch] gh/IvanKobzarev/162/orig -> origin/gh/IvanKobzarev/162/orig 2025-12-04T08:56:58.9377384Z * [new branch] gh/IvanKobzarev/163/base -> origin/gh/IvanKobzarev/163/base 2025-12-04T08:56:58.9378959Z * [new branch] gh/IvanKobzarev/163/head -> origin/gh/IvanKobzarev/163/head 2025-12-04T08:56:58.9380484Z * [new branch] gh/IvanKobzarev/163/orig -> origin/gh/IvanKobzarev/163/orig 2025-12-04T08:56:58.9382645Z * [new branch] gh/IvanKobzarev/166/base -> origin/gh/IvanKobzarev/166/base 2025-12-04T08:56:58.9384167Z * [new branch] gh/IvanKobzarev/166/head -> origin/gh/IvanKobzarev/166/head 2025-12-04T08:56:58.9385772Z * [new branch] gh/IvanKobzarev/166/orig -> origin/gh/IvanKobzarev/166/orig 2025-12-04T08:56:58.9387896Z * [new branch] gh/IvanKobzarev/167/base -> origin/gh/IvanKobzarev/167/base 2025-12-04T08:56:58.9389420Z * [new branch] gh/IvanKobzarev/167/head -> origin/gh/IvanKobzarev/167/head 2025-12-04T08:56:58.9390943Z * [new branch] gh/IvanKobzarev/167/orig -> origin/gh/IvanKobzarev/167/orig 2025-12-04T08:56:58.9393101Z * [new branch] gh/IvanKobzarev/168/base -> origin/gh/IvanKobzarev/168/base 2025-12-04T08:56:58.9394702Z * [new branch] gh/IvanKobzarev/168/head -> origin/gh/IvanKobzarev/168/head 2025-12-04T08:56:58.9396405Z * [new branch] gh/IvanKobzarev/168/orig -> origin/gh/IvanKobzarev/168/orig 2025-12-04T08:56:58.9398373Z * [new branch] gh/IvanKobzarev/169/base -> origin/gh/IvanKobzarev/169/base 2025-12-04T08:56:58.9399919Z * [new branch] gh/IvanKobzarev/169/head -> origin/gh/IvanKobzarev/169/head 2025-12-04T08:56:58.9401464Z * [new branch] gh/IvanKobzarev/169/orig -> origin/gh/IvanKobzarev/169/orig 2025-12-04T08:56:58.9403463Z * [new branch] gh/IvanKobzarev/170/base -> origin/gh/IvanKobzarev/170/base 2025-12-04T08:56:58.9405026Z * [new branch] gh/IvanKobzarev/170/head -> origin/gh/IvanKobzarev/170/head 2025-12-04T08:56:58.9406649Z * [new branch] gh/IvanKobzarev/170/orig -> origin/gh/IvanKobzarev/170/orig 2025-12-04T08:56:58.9408870Z * [new branch] gh/IvanKobzarev/171/base -> origin/gh/IvanKobzarev/171/base 2025-12-04T08:56:58.9410420Z * [new branch] gh/IvanKobzarev/171/head -> origin/gh/IvanKobzarev/171/head 2025-12-04T08:56:58.9411993Z * [new branch] gh/IvanKobzarev/171/orig -> origin/gh/IvanKobzarev/171/orig 2025-12-04T08:56:58.9414161Z * [new branch] gh/IvanKobzarev/172/base -> origin/gh/IvanKobzarev/172/base 2025-12-04T08:56:58.9415763Z * [new branch] gh/IvanKobzarev/172/head -> origin/gh/IvanKobzarev/172/head 2025-12-04T08:56:58.9417324Z * [new branch] gh/IvanKobzarev/172/orig -> origin/gh/IvanKobzarev/172/orig 2025-12-04T08:56:58.9419530Z * [new branch] gh/IvanKobzarev/173/base -> origin/gh/IvanKobzarev/173/base 2025-12-04T08:56:58.9421021Z * [new branch] gh/IvanKobzarev/173/head -> origin/gh/IvanKobzarev/173/head 2025-12-04T08:56:58.9422647Z * [new branch] gh/IvanKobzarev/173/orig -> origin/gh/IvanKobzarev/173/orig 2025-12-04T08:56:58.9424854Z * [new branch] gh/IvanKobzarev/174/base -> origin/gh/IvanKobzarev/174/base 2025-12-04T08:56:58.9426445Z * [new branch] gh/IvanKobzarev/174/head -> origin/gh/IvanKobzarev/174/head 2025-12-04T08:56:58.9428058Z * [new branch] gh/IvanKobzarev/174/orig -> origin/gh/IvanKobzarev/174/orig 2025-12-04T08:56:58.9430171Z * [new branch] gh/IvanKobzarev/175/base -> origin/gh/IvanKobzarev/175/base 2025-12-04T08:56:58.9431883Z * [new branch] gh/IvanKobzarev/175/head -> origin/gh/IvanKobzarev/175/head 2025-12-04T08:56:58.9433401Z * [new branch] gh/IvanKobzarev/175/orig -> origin/gh/IvanKobzarev/175/orig 2025-12-04T08:56:58.9435722Z * [new branch] gh/IvanKobzarev/176/base -> origin/gh/IvanKobzarev/176/base 2025-12-04T08:56:58.9437263Z * [new branch] gh/IvanKobzarev/176/head -> origin/gh/IvanKobzarev/176/head 2025-12-04T08:56:58.9438794Z * [new branch] gh/IvanKobzarev/176/orig -> origin/gh/IvanKobzarev/176/orig 2025-12-04T08:56:58.9441225Z * [new branch] gh/IvanKobzarev/177/base -> origin/gh/IvanKobzarev/177/base 2025-12-04T08:56:58.9442854Z * [new branch] gh/IvanKobzarev/177/head -> origin/gh/IvanKobzarev/177/head 2025-12-04T08:56:58.9444431Z * [new branch] gh/IvanKobzarev/177/orig -> origin/gh/IvanKobzarev/177/orig 2025-12-04T08:56:58.9446708Z * [new branch] gh/IvanKobzarev/178/base -> origin/gh/IvanKobzarev/178/base 2025-12-04T08:56:58.9448300Z * [new branch] gh/IvanKobzarev/178/head -> origin/gh/IvanKobzarev/178/head 2025-12-04T08:56:58.9449878Z * [new branch] gh/IvanKobzarev/178/orig -> origin/gh/IvanKobzarev/178/orig 2025-12-04T08:56:58.9452096Z * [new branch] gh/IvanKobzarev/179/base -> origin/gh/IvanKobzarev/179/base 2025-12-04T08:56:58.9453743Z * [new branch] gh/IvanKobzarev/179/head -> origin/gh/IvanKobzarev/179/head 2025-12-04T08:56:58.9455226Z * [new branch] gh/IvanKobzarev/179/orig -> origin/gh/IvanKobzarev/179/orig 2025-12-04T08:56:58.9457618Z * [new branch] gh/IvanKobzarev/180/base -> origin/gh/IvanKobzarev/180/base 2025-12-04T08:56:58.9459112Z * [new branch] gh/IvanKobzarev/180/head -> origin/gh/IvanKobzarev/180/head 2025-12-04T08:56:58.9460640Z * [new branch] gh/IvanKobzarev/180/orig -> origin/gh/IvanKobzarev/180/orig 2025-12-04T08:56:58.9463177Z * [new branch] gh/IvanKobzarev/181/base -> origin/gh/IvanKobzarev/181/base 2025-12-04T08:56:58.9464781Z * [new branch] gh/IvanKobzarev/181/head -> origin/gh/IvanKobzarev/181/head 2025-12-04T08:56:58.9466380Z * [new branch] gh/IvanKobzarev/181/orig -> origin/gh/IvanKobzarev/181/orig 2025-12-04T08:56:58.9468728Z * [new branch] gh/IvanKobzarev/182/base -> origin/gh/IvanKobzarev/182/base 2025-12-04T08:56:58.9470298Z * [new branch] gh/IvanKobzarev/182/head -> origin/gh/IvanKobzarev/182/head 2025-12-04T08:56:58.9471835Z * [new branch] gh/IvanKobzarev/182/orig -> origin/gh/IvanKobzarev/182/orig 2025-12-04T08:56:58.9474177Z * [new branch] gh/IvanKobzarev/183/base -> origin/gh/IvanKobzarev/183/base 2025-12-04T08:56:58.9475782Z * [new branch] gh/IvanKobzarev/183/head -> origin/gh/IvanKobzarev/183/head 2025-12-04T08:56:58.9477335Z * [new branch] gh/IvanKobzarev/183/orig -> origin/gh/IvanKobzarev/183/orig 2025-12-04T08:56:58.9479801Z * [new branch] gh/IvanKobzarev/184/base -> origin/gh/IvanKobzarev/184/base 2025-12-04T08:56:58.9481382Z * [new branch] gh/IvanKobzarev/184/head -> origin/gh/IvanKobzarev/184/head 2025-12-04T08:56:58.9482942Z * [new branch] gh/IvanKobzarev/184/orig -> origin/gh/IvanKobzarev/184/orig 2025-12-04T08:56:58.9485563Z * [new branch] gh/NikhilAPatel/1/base -> origin/gh/NikhilAPatel/1/base 2025-12-04T08:56:58.9487204Z * [new branch] gh/NikhilAPatel/1/head -> origin/gh/NikhilAPatel/1/head 2025-12-04T08:56:58.9489183Z * [new branch] gh/NikhilAPatel/2/base -> origin/gh/NikhilAPatel/2/base 2025-12-04T08:56:58.9490707Z * [new branch] gh/NikhilAPatel/2/head -> origin/gh/NikhilAPatel/2/head 2025-12-04T08:56:58.9492998Z * [new branch] gh/NikhilAPatel/4/base -> origin/gh/NikhilAPatel/4/base 2025-12-04T08:56:58.9494578Z * [new branch] gh/NikhilAPatel/4/head -> origin/gh/NikhilAPatel/4/head 2025-12-04T08:56:58.9511233Z * [new branch] gh/NikhilAPatel/5/base -> origin/gh/NikhilAPatel/5/base 2025-12-04T08:56:58.9511880Z * [new branch] gh/NikhilAPatel/5/head -> origin/gh/NikhilAPatel/5/head 2025-12-04T08:56:58.9512339Z * [new branch] gh/NikhilAPatel/5/orig -> origin/gh/NikhilAPatel/5/orig 2025-12-04T08:56:58.9513341Z * [new branch] gh/PaliC/17/base -> origin/gh/PaliC/17/base 2025-12-04T08:56:58.9513732Z * [new branch] gh/PaliC/17/head -> origin/gh/PaliC/17/head 2025-12-04T08:56:58.9514120Z * [new branch] gh/PaliC/17/orig -> origin/gh/PaliC/17/orig 2025-12-04T08:56:58.9514470Z * [new branch] gh/PaliC/18/base -> origin/gh/PaliC/18/base 2025-12-04T08:56:58.9514823Z * [new branch] gh/PaliC/18/head -> origin/gh/PaliC/18/head 2025-12-04T08:56:58.9515171Z * [new branch] gh/PaliC/18/orig -> origin/gh/PaliC/18/orig 2025-12-04T08:56:58.9515526Z * [new branch] gh/PaliC/20/base -> origin/gh/PaliC/20/base 2025-12-04T08:56:58.9515866Z * [new branch] gh/PaliC/20/head -> origin/gh/PaliC/20/head 2025-12-04T08:56:58.9516222Z * [new branch] gh/PaliC/20/orig -> origin/gh/PaliC/20/orig 2025-12-04T08:56:58.9518885Z * [new branch] gh/PaliC/21/base -> origin/gh/PaliC/21/base 2025-12-04T08:56:58.9520220Z * [new branch] gh/PaliC/21/head -> origin/gh/PaliC/21/head 2025-12-04T08:56:58.9521992Z * [new branch] gh/PaliC/21/orig -> origin/gh/PaliC/21/orig 2025-12-04T08:56:58.9523621Z * [new branch] gh/PaliC/23/base -> origin/gh/PaliC/23/base 2025-12-04T08:56:58.9525305Z * [new branch] gh/PaliC/23/head -> origin/gh/PaliC/23/head 2025-12-04T08:56:58.9526966Z * [new branch] gh/PaliC/23/orig -> origin/gh/PaliC/23/orig 2025-12-04T08:56:58.9528995Z * [new branch] gh/PaliC/24/base -> origin/gh/PaliC/24/base 2025-12-04T08:56:58.9530509Z * [new branch] gh/PaliC/24/head -> origin/gh/PaliC/24/head 2025-12-04T08:56:58.9532036Z * [new branch] gh/PaliC/24/orig -> origin/gh/PaliC/24/orig 2025-12-04T08:56:58.9534336Z * [new branch] gh/PaliC/25/head -> origin/gh/PaliC/25/head 2025-12-04T08:56:58.9535696Z * [new branch] gh/PaliC/25/next -> origin/gh/PaliC/25/next 2025-12-04T08:56:58.9537516Z * [new branch] gh/PaliC/25/orig -> origin/gh/PaliC/25/orig 2025-12-04T08:56:58.9539548Z * [new branch] gh/PaliC/26/head -> origin/gh/PaliC/26/head 2025-12-04T08:56:58.9540845Z * [new branch] gh/PaliC/26/next -> origin/gh/PaliC/26/next 2025-12-04T08:56:58.9542647Z * [new branch] gh/PaliC/26/orig -> origin/gh/PaliC/26/orig 2025-12-04T08:56:58.9544628Z * [new branch] gh/PaliC/27/next -> origin/gh/PaliC/27/next 2025-12-04T08:56:58.9546756Z * [new branch] gh/PaliC/28/head -> origin/gh/PaliC/28/head 2025-12-04T08:56:58.9548037Z * [new branch] gh/PaliC/28/next -> origin/gh/PaliC/28/next 2025-12-04T08:56:58.9549764Z * [new branch] gh/PaliC/28/orig -> origin/gh/PaliC/28/orig 2025-12-04T08:56:58.9551924Z * [new branch] gh/PaliC/29/head -> origin/gh/PaliC/29/head 2025-12-04T08:56:58.9553158Z * [new branch] gh/PaliC/29/next -> origin/gh/PaliC/29/next 2025-12-04T08:56:58.9554835Z * [new branch] gh/PaliC/29/orig -> origin/gh/PaliC/29/orig 2025-12-04T08:56:58.9556894Z * [new branch] gh/PaliC/30/head -> origin/gh/PaliC/30/head 2025-12-04T08:56:58.9558243Z * [new branch] gh/PaliC/30/next -> origin/gh/PaliC/30/next 2025-12-04T08:56:58.9559925Z * [new branch] gh/PaliC/30/orig -> origin/gh/PaliC/30/orig 2025-12-04T08:56:58.9562119Z * [new branch] gh/PaliC/31/head -> origin/gh/PaliC/31/head 2025-12-04T08:56:58.9563815Z * [new branch] gh/PaliC/31/next -> origin/gh/PaliC/31/next 2025-12-04T08:56:58.9565178Z * [new branch] gh/PaliC/31/orig -> origin/gh/PaliC/31/orig 2025-12-04T08:56:58.9567782Z * [new branch] gh/PaulZhang12/25/base -> origin/gh/PaulZhang12/25/base 2025-12-04T08:56:58.9569234Z * [new branch] gh/PaulZhang12/25/head -> origin/gh/PaulZhang12/25/head 2025-12-04T08:56:58.9570724Z * [new branch] gh/PaulZhang12/25/orig -> origin/gh/PaulZhang12/25/orig 2025-12-04T08:56:58.9572921Z * [new branch] gh/PaulZhang12/28/base -> origin/gh/PaulZhang12/28/base 2025-12-04T08:56:58.9574494Z * [new branch] gh/PaulZhang12/28/head -> origin/gh/PaulZhang12/28/head 2025-12-04T08:56:58.9576057Z * [new branch] gh/PaulZhang12/28/orig -> origin/gh/PaulZhang12/28/orig 2025-12-04T08:56:58.9578370Z * [new branch] gh/PaulZhang12/31/base -> origin/gh/PaulZhang12/31/base 2025-12-04T08:56:58.9579902Z * [new branch] gh/PaulZhang12/31/head -> origin/gh/PaulZhang12/31/head 2025-12-04T08:56:58.9581559Z * [new branch] gh/PaulZhang12/31/orig -> origin/gh/PaulZhang12/31/orig 2025-12-04T08:56:58.9584420Z * [new branch] gh/PaulZhang12/37/base -> origin/gh/PaulZhang12/37/base 2025-12-04T08:56:58.9586084Z * [new branch] gh/PaulZhang12/37/head -> origin/gh/PaulZhang12/37/head 2025-12-04T08:56:58.9587246Z * [new branch] gh/PaulZhang12/37/orig -> origin/gh/PaulZhang12/37/orig 2025-12-04T08:56:58.9589412Z * [new branch] gh/PaulZhang12/40/base -> origin/gh/PaulZhang12/40/base 2025-12-04T08:56:58.9590961Z * [new branch] gh/PaulZhang12/40/head -> origin/gh/PaulZhang12/40/head 2025-12-04T08:56:58.9592464Z * [new branch] gh/PaulZhang12/40/orig -> origin/gh/PaulZhang12/40/orig 2025-12-04T08:56:58.9594628Z * [new branch] gh/PaulZhang12/42/base -> origin/gh/PaulZhang12/42/base 2025-12-04T08:56:58.9596233Z * [new branch] gh/PaulZhang12/42/head -> origin/gh/PaulZhang12/42/head 2025-12-04T08:56:58.9598336Z * [new branch] gh/PaulZhang12/43/base -> origin/gh/PaulZhang12/43/base 2025-12-04T08:56:58.9599939Z * [new branch] gh/PaulZhang12/43/head -> origin/gh/PaulZhang12/43/head 2025-12-04T08:56:58.9601511Z * [new branch] gh/PaulZhang12/43/orig -> origin/gh/PaulZhang12/43/orig 2025-12-04T08:56:58.9603491Z * [new branch] gh/PaulZhang12/44/base -> origin/gh/PaulZhang12/44/base 2025-12-04T08:56:58.9605090Z * [new branch] gh/PaulZhang12/44/head -> origin/gh/PaulZhang12/44/head 2025-12-04T08:56:58.9607249Z * [new branch] gh/PaulZhang12/45/base -> origin/gh/PaulZhang12/45/base 2025-12-04T08:56:58.9608755Z * [new branch] gh/PaulZhang12/45/head -> origin/gh/PaulZhang12/45/head 2025-12-04T08:56:58.9610258Z * [new branch] gh/PaulZhang12/45/orig -> origin/gh/PaulZhang12/45/orig 2025-12-04T08:56:58.9612415Z * [new branch] gh/PaulZhang12/46/base -> origin/gh/PaulZhang12/46/base 2025-12-04T08:56:58.9614124Z * [new branch] gh/PaulZhang12/46/head -> origin/gh/PaulZhang12/46/head 2025-12-04T08:56:58.9615698Z * [new branch] gh/PaulZhang12/46/orig -> origin/gh/PaulZhang12/46/orig 2025-12-04T08:56:58.9617823Z * [new branch] gh/PaulZhang12/47/base -> origin/gh/PaulZhang12/47/base 2025-12-04T08:56:58.9619479Z * [new branch] gh/PaulZhang12/47/head -> origin/gh/PaulZhang12/47/head 2025-12-04T08:56:58.9621019Z * [new branch] gh/PaulZhang12/47/orig -> origin/gh/PaulZhang12/47/orig 2025-12-04T08:56:58.9622995Z * [new branch] gh/PaulZhang12/48/base -> origin/gh/PaulZhang12/48/base 2025-12-04T08:56:58.9624527Z * [new branch] gh/PaulZhang12/48/head -> origin/gh/PaulZhang12/48/head 2025-12-04T08:56:58.9626079Z * [new branch] gh/PaulZhang12/48/orig -> origin/gh/PaulZhang12/48/orig 2025-12-04T08:56:58.9628497Z * [new branch] gh/SamGinzburg/11/base -> origin/gh/SamGinzburg/11/base 2025-12-04T08:56:58.9630040Z * [new branch] gh/SamGinzburg/11/head -> origin/gh/SamGinzburg/11/head 2025-12-04T08:56:58.9632696Z * [new branch] gh/SherlockNoMad/1/base -> origin/gh/SherlockNoMad/1/base 2025-12-04T08:56:58.9634226Z * [new branch] gh/SherlockNoMad/1/head -> origin/gh/SherlockNoMad/1/head 2025-12-04T08:56:58.9636394Z * [new branch] gh/SherlockNoMad/10/base -> origin/gh/SherlockNoMad/10/base 2025-12-04T08:56:58.9638042Z * [new branch] gh/SherlockNoMad/10/head -> origin/gh/SherlockNoMad/10/head 2025-12-04T08:56:58.9639614Z * [new branch] gh/SherlockNoMad/10/orig -> origin/gh/SherlockNoMad/10/orig 2025-12-04T08:56:58.9641630Z * [new branch] gh/SherlockNoMad/11/base -> origin/gh/SherlockNoMad/11/base 2025-12-04T08:56:58.9643173Z * [new branch] gh/SherlockNoMad/11/head -> origin/gh/SherlockNoMad/11/head 2025-12-04T08:56:58.9644829Z * [new branch] gh/SherlockNoMad/11/orig -> origin/gh/SherlockNoMad/11/orig 2025-12-04T08:56:58.9646853Z * [new branch] gh/SherlockNoMad/12/base -> origin/gh/SherlockNoMad/12/base 2025-12-04T08:56:58.9648349Z * [new branch] gh/SherlockNoMad/12/head -> origin/gh/SherlockNoMad/12/head 2025-12-04T08:56:58.9649860Z * [new branch] gh/SherlockNoMad/12/orig -> origin/gh/SherlockNoMad/12/orig 2025-12-04T08:56:58.9652021Z * [new branch] gh/SherlockNoMad/15/base -> origin/gh/SherlockNoMad/15/base 2025-12-04T08:56:58.9653743Z * [new branch] gh/SherlockNoMad/15/head -> origin/gh/SherlockNoMad/15/head 2025-12-04T08:56:58.9655261Z * [new branch] gh/SherlockNoMad/15/orig -> origin/gh/SherlockNoMad/15/orig 2025-12-04T08:56:58.9657430Z * [new branch] gh/SherlockNoMad/17/base -> origin/gh/SherlockNoMad/17/base 2025-12-04T08:56:58.9658921Z * [new branch] gh/SherlockNoMad/17/head -> origin/gh/SherlockNoMad/17/head 2025-12-04T08:56:58.9660491Z * [new branch] gh/SherlockNoMad/17/orig -> origin/gh/SherlockNoMad/17/orig 2025-12-04T08:56:58.9662888Z * [new branch] gh/SherlockNoMad/18/base -> origin/gh/SherlockNoMad/18/base 2025-12-04T08:56:58.9664798Z * [new branch] gh/SherlockNoMad/18/head -> origin/gh/SherlockNoMad/18/head 2025-12-04T08:56:58.9666380Z * [new branch] gh/SherlockNoMad/18/orig -> origin/gh/SherlockNoMad/18/orig 2025-12-04T08:56:58.9668530Z * [new branch] gh/SherlockNoMad/19/base -> origin/gh/SherlockNoMad/19/base 2025-12-04T08:56:58.9670160Z * [new branch] gh/SherlockNoMad/19/head -> origin/gh/SherlockNoMad/19/head 2025-12-04T08:56:58.9671758Z * [new branch] gh/SherlockNoMad/19/orig -> origin/gh/SherlockNoMad/19/orig 2025-12-04T08:56:58.9673744Z * [new branch] gh/SherlockNoMad/2/base -> origin/gh/SherlockNoMad/2/base 2025-12-04T08:56:58.9675228Z * [new branch] gh/SherlockNoMad/2/head -> origin/gh/SherlockNoMad/2/head 2025-12-04T08:56:58.9677208Z * [new branch] gh/SherlockNoMad/20/base -> origin/gh/SherlockNoMad/20/base 2025-12-04T08:56:58.9678872Z * [new branch] gh/SherlockNoMad/20/head -> origin/gh/SherlockNoMad/20/head 2025-12-04T08:56:58.9680326Z * [new branch] gh/SherlockNoMad/20/orig -> origin/gh/SherlockNoMad/20/orig 2025-12-04T08:56:58.9682603Z * [new branch] gh/SherlockNoMad/21/base -> origin/gh/SherlockNoMad/21/base 2025-12-04T08:56:58.9684275Z * [new branch] gh/SherlockNoMad/21/head -> origin/gh/SherlockNoMad/21/head 2025-12-04T08:56:58.9685771Z * [new branch] gh/SherlockNoMad/21/orig -> origin/gh/SherlockNoMad/21/orig 2025-12-04T08:56:58.9687749Z * [new branch] gh/SherlockNoMad/3/base -> origin/gh/SherlockNoMad/3/base 2025-12-04T08:56:58.9689258Z * [new branch] gh/SherlockNoMad/3/head -> origin/gh/SherlockNoMad/3/head 2025-12-04T08:56:58.9691247Z * [new branch] gh/SherlockNoMad/4/base -> origin/gh/SherlockNoMad/4/base 2025-12-04T08:56:58.9692874Z * [new branch] gh/SherlockNoMad/4/head -> origin/gh/SherlockNoMad/4/head 2025-12-04T08:56:58.9694844Z * [new branch] gh/SherlockNoMad/5/base -> origin/gh/SherlockNoMad/5/base 2025-12-04T08:56:58.9696357Z * [new branch] gh/SherlockNoMad/5/head -> origin/gh/SherlockNoMad/5/head 2025-12-04T08:56:58.9699402Z * [new branch] gh/Sidharth123-cpu/24/base -> origin/gh/Sidharth123-cpu/24/base 2025-12-04T08:56:58.9701427Z * [new branch] gh/Sidharth123-cpu/25/base -> origin/gh/Sidharth123-cpu/25/base 2025-12-04T08:56:58.9703382Z * [new branch] gh/Sidharth123-cpu/26/base -> origin/gh/Sidharth123-cpu/26/base 2025-12-04T08:56:58.9705627Z * [new branch] gh/Sidharth123-cpu/27/base -> origin/gh/Sidharth123-cpu/27/base 2025-12-04T08:56:58.9708167Z * [new branch] gh/StrongerXi/1/base -> origin/gh/StrongerXi/1/base 2025-12-04T08:56:58.9709938Z * [new branch] gh/StrongerXi/1/head -> origin/gh/StrongerXi/1/head 2025-12-04T08:56:58.9711958Z * [new branch] gh/StrongerXi/71/base -> origin/gh/StrongerXi/71/base 2025-12-04T08:56:58.9713484Z * [new branch] gh/StrongerXi/71/head -> origin/gh/StrongerXi/71/head 2025-12-04T08:56:58.9715478Z * [new branch] gh/StrongerXi/72/base -> origin/gh/StrongerXi/72/base 2025-12-04T08:56:58.9717064Z * [new branch] gh/StrongerXi/72/head -> origin/gh/StrongerXi/72/head 2025-12-04T08:56:58.9719094Z * [new branch] gh/StrongerXi/73/base -> origin/gh/StrongerXi/73/base 2025-12-04T08:56:58.9720644Z * [new branch] gh/StrongerXi/73/head -> origin/gh/StrongerXi/73/head 2025-12-04T08:56:58.9722265Z * [new branch] gh/StrongerXi/73/orig -> origin/gh/StrongerXi/73/orig 2025-12-04T08:56:58.9724958Z * [new branch] gh/XilunWu/160/base -> origin/gh/XilunWu/160/base 2025-12-04T08:56:58.9726412Z * [new branch] gh/XilunWu/160/head -> origin/gh/XilunWu/160/head 2025-12-04T08:56:58.9728083Z * [new branch] gh/XilunWu/160/orig -> origin/gh/XilunWu/160/orig 2025-12-04T08:56:58.9730166Z * [new branch] gh/XilunWu/163/base -> origin/gh/XilunWu/163/base 2025-12-04T08:56:58.9731766Z * [new branch] gh/XilunWu/163/head -> origin/gh/XilunWu/163/head 2025-12-04T08:56:58.9733476Z * [new branch] gh/XilunWu/163/orig -> origin/gh/XilunWu/163/orig 2025-12-04T08:56:58.9735651Z * [new branch] gh/XilunWu/168/base -> origin/gh/XilunWu/168/base 2025-12-04T08:56:58.9737137Z * [new branch] gh/XilunWu/168/head -> origin/gh/XilunWu/168/head 2025-12-04T08:56:58.9738695Z * [new branch] gh/XilunWu/168/orig -> origin/gh/XilunWu/168/orig 2025-12-04T08:56:58.9740901Z * [new branch] gh/XilunWu/169/base -> origin/gh/XilunWu/169/base 2025-12-04T08:56:58.9742439Z * [new branch] gh/XilunWu/169/head -> origin/gh/XilunWu/169/head 2025-12-04T08:56:58.9744007Z * [new branch] gh/XilunWu/169/orig -> origin/gh/XilunWu/169/orig 2025-12-04T08:56:58.9745980Z * [new branch] gh/XilunWu/170/base -> origin/gh/XilunWu/170/base 2025-12-04T08:56:58.9747503Z * [new branch] gh/XilunWu/170/head -> origin/gh/XilunWu/170/head 2025-12-04T08:56:58.9749082Z * [new branch] gh/XilunWu/170/orig -> origin/gh/XilunWu/170/orig 2025-12-04T08:56:58.9751284Z * [new branch] gh/XilunWu/171/base -> origin/gh/XilunWu/171/base 2025-12-04T08:56:58.9752888Z * [new branch] gh/XilunWu/171/head -> origin/gh/XilunWu/171/head 2025-12-04T08:56:58.9754427Z * [new branch] gh/XilunWu/171/orig -> origin/gh/XilunWu/171/orig 2025-12-04T08:56:58.9756462Z * [new branch] gh/XilunWu/173/base -> origin/gh/XilunWu/173/base 2025-12-04T08:56:58.9758073Z * [new branch] gh/XilunWu/173/head -> origin/gh/XilunWu/173/head 2025-12-04T08:56:58.9759646Z * [new branch] gh/XilunWu/173/orig -> origin/gh/XilunWu/173/orig 2025-12-04T08:56:58.9761924Z * [new branch] gh/XilunWu/175/base -> origin/gh/XilunWu/175/base 2025-12-04T08:56:58.9763550Z * [new branch] gh/XilunWu/175/head -> origin/gh/XilunWu/175/head 2025-12-04T08:56:58.9765072Z * [new branch] gh/XilunWu/175/orig -> origin/gh/XilunWu/175/orig 2025-12-04T08:56:58.9767294Z * [new branch] gh/XilunWu/176/base -> origin/gh/XilunWu/176/base 2025-12-04T08:56:58.9768845Z * [new branch] gh/XilunWu/176/head -> origin/gh/XilunWu/176/head 2025-12-04T08:56:58.9770539Z * [new branch] gh/XilunWu/176/orig -> origin/gh/XilunWu/176/orig 2025-12-04T08:56:58.9773426Z * [new branch] gh/XuehaiPan/14/base -> origin/gh/XuehaiPan/14/base 2025-12-04T08:56:58.9774871Z * [new branch] gh/XuehaiPan/14/head -> origin/gh/XuehaiPan/14/head 2025-12-04T08:56:58.9776335Z * [new branch] gh/XuehaiPan/14/orig -> origin/gh/XuehaiPan/14/orig 2025-12-04T08:56:58.9778439Z * [new branch] gh/XuehaiPan/179/base -> origin/gh/XuehaiPan/179/base 2025-12-04T08:56:58.9780056Z * [new branch] gh/XuehaiPan/179/head -> origin/gh/XuehaiPan/179/head 2025-12-04T08:56:58.9781707Z * [new branch] gh/XuehaiPan/179/orig -> origin/gh/XuehaiPan/179/orig 2025-12-04T08:56:58.9783753Z * [new branch] gh/XuehaiPan/249/base -> origin/gh/XuehaiPan/249/base 2025-12-04T08:56:58.9785348Z * [new branch] gh/XuehaiPan/249/head -> origin/gh/XuehaiPan/249/head 2025-12-04T08:56:58.9786962Z * [new branch] gh/XuehaiPan/249/orig -> origin/gh/XuehaiPan/249/orig 2025-12-04T08:56:58.9789091Z * [new branch] gh/XuehaiPan/253/base -> origin/gh/XuehaiPan/253/base 2025-12-04T08:56:58.9790655Z * [new branch] gh/XuehaiPan/253/head -> origin/gh/XuehaiPan/253/head 2025-12-04T08:56:58.9792253Z * [new branch] gh/XuehaiPan/253/orig -> origin/gh/XuehaiPan/253/orig 2025-12-04T08:56:58.9794360Z * [new branch] gh/XuehaiPan/254/base -> origin/gh/XuehaiPan/254/base 2025-12-04T08:56:58.9795910Z * [new branch] gh/XuehaiPan/254/head -> origin/gh/XuehaiPan/254/head 2025-12-04T08:56:58.9797495Z * [new branch] gh/XuehaiPan/254/orig -> origin/gh/XuehaiPan/254/orig 2025-12-04T08:56:58.9799544Z * [new branch] gh/XuehaiPan/255/base -> origin/gh/XuehaiPan/255/base 2025-12-04T08:56:58.9801107Z * [new branch] gh/XuehaiPan/255/head -> origin/gh/XuehaiPan/255/head 2025-12-04T08:56:58.9802665Z * [new branch] gh/XuehaiPan/255/orig -> origin/gh/XuehaiPan/255/orig 2025-12-04T08:56:58.9804707Z * [new branch] gh/XuehaiPan/271/base -> origin/gh/XuehaiPan/271/base 2025-12-04T08:56:58.9806254Z * [new branch] gh/XuehaiPan/271/head -> origin/gh/XuehaiPan/271/head 2025-12-04T08:56:58.9807830Z * [new branch] gh/XuehaiPan/271/orig -> origin/gh/XuehaiPan/271/orig 2025-12-04T08:56:58.9809861Z * [new branch] gh/XuehaiPan/343/base -> origin/gh/XuehaiPan/343/base 2025-12-04T08:56:58.9811391Z * [new branch] gh/XuehaiPan/343/head -> origin/gh/XuehaiPan/343/head 2025-12-04T08:56:58.9813007Z * [new branch] gh/XuehaiPan/343/orig -> origin/gh/XuehaiPan/343/orig 2025-12-04T08:56:58.9815158Z * [new branch] gh/XuehaiPan/347/base -> origin/gh/XuehaiPan/347/base 2025-12-04T08:56:58.9816746Z * [new branch] gh/XuehaiPan/347/head -> origin/gh/XuehaiPan/347/head 2025-12-04T08:56:58.9818302Z * [new branch] gh/XuehaiPan/347/orig -> origin/gh/XuehaiPan/347/orig 2025-12-04T08:56:58.9820402Z * [new branch] gh/XuehaiPan/348/base -> origin/gh/XuehaiPan/348/base 2025-12-04T08:56:58.9822014Z * [new branch] gh/XuehaiPan/348/head -> origin/gh/XuehaiPan/348/head 2025-12-04T08:56:58.9823550Z * [new branch] gh/XuehaiPan/348/orig -> origin/gh/XuehaiPan/348/orig 2025-12-04T08:56:58.9825624Z * [new branch] gh/XuehaiPan/350/base -> origin/gh/XuehaiPan/350/base 2025-12-04T08:56:58.9827153Z * [new branch] gh/XuehaiPan/350/head -> origin/gh/XuehaiPan/350/head 2025-12-04T08:56:58.9828698Z * [new branch] gh/XuehaiPan/350/orig -> origin/gh/XuehaiPan/350/orig 2025-12-04T08:56:58.9830799Z * [new branch] gh/XuehaiPan/365/base -> origin/gh/XuehaiPan/365/base 2025-12-04T08:56:58.9832449Z * [new branch] gh/XuehaiPan/365/head -> origin/gh/XuehaiPan/365/head 2025-12-04T08:56:58.9834161Z * [new branch] gh/XuehaiPan/365/orig -> origin/gh/XuehaiPan/365/orig 2025-12-04T08:56:58.9836208Z * [new branch] gh/XuehaiPan/366/base -> origin/gh/XuehaiPan/366/base 2025-12-04T08:56:58.9837760Z * [new branch] gh/XuehaiPan/366/head -> origin/gh/XuehaiPan/366/head 2025-12-04T08:56:58.9839850Z * [new branch] gh/XuehaiPan/370/base -> origin/gh/XuehaiPan/370/base 2025-12-04T08:56:58.9841415Z * [new branch] gh/XuehaiPan/370/head -> origin/gh/XuehaiPan/370/head 2025-12-04T08:56:58.9842969Z * [new branch] gh/XuehaiPan/370/orig -> origin/gh/XuehaiPan/370/orig 2025-12-04T08:56:58.9845135Z * [new branch] gh/XuehaiPan/390/base -> origin/gh/XuehaiPan/390/base 2025-12-04T08:56:58.9846712Z * [new branch] gh/XuehaiPan/390/head -> origin/gh/XuehaiPan/390/head 2025-12-04T08:56:58.9848439Z * [new branch] gh/XuehaiPan/390/orig -> origin/gh/XuehaiPan/390/orig 2025-12-04T08:56:58.9850538Z * [new branch] gh/XuehaiPan/391/base -> origin/gh/XuehaiPan/391/base 2025-12-04T08:56:58.9852115Z * [new branch] gh/XuehaiPan/391/head -> origin/gh/XuehaiPan/391/head 2025-12-04T08:56:58.9853800Z * [new branch] gh/XuehaiPan/391/orig -> origin/gh/XuehaiPan/391/orig 2025-12-04T08:56:58.9855821Z * [new branch] gh/XuehaiPan/392/base -> origin/gh/XuehaiPan/392/base 2025-12-04T08:56:58.9857441Z * [new branch] gh/XuehaiPan/392/head -> origin/gh/XuehaiPan/392/head 2025-12-04T08:56:58.9858944Z * [new branch] gh/XuehaiPan/392/orig -> origin/gh/XuehaiPan/392/orig 2025-12-04T08:56:58.9861558Z * [new branch] gh/XuehaiPan/394/base -> origin/gh/XuehaiPan/394/base 2025-12-04T08:56:58.9863475Z * [new branch] gh/XuehaiPan/394/head -> origin/gh/XuehaiPan/394/head 2025-12-04T08:56:58.9865159Z * [new branch] gh/XuehaiPan/394/orig -> origin/gh/XuehaiPan/394/orig 2025-12-04T08:56:58.9867467Z * [new branch] gh/XuehaiPan/397/base -> origin/gh/XuehaiPan/397/base 2025-12-04T08:56:58.9869186Z * [new branch] gh/XuehaiPan/397/head -> origin/gh/XuehaiPan/397/head 2025-12-04T08:56:58.9870542Z * [new branch] gh/XuehaiPan/397/orig -> origin/gh/XuehaiPan/397/orig 2025-12-04T08:56:58.9872759Z * [new branch] gh/XuehaiPan/398/base -> origin/gh/XuehaiPan/398/base 2025-12-04T08:56:58.9874212Z * [new branch] gh/XuehaiPan/398/head -> origin/gh/XuehaiPan/398/head 2025-12-04T08:56:58.9875866Z * [new branch] gh/XuehaiPan/398/orig -> origin/gh/XuehaiPan/398/orig 2025-12-04T08:56:58.9877961Z * [new branch] gh/XuehaiPan/399/base -> origin/gh/XuehaiPan/399/base 2025-12-04T08:56:58.9879411Z * [new branch] gh/XuehaiPan/399/head -> origin/gh/XuehaiPan/399/head 2025-12-04T08:56:58.9881254Z * [new branch] gh/XuehaiPan/399/orig -> origin/gh/XuehaiPan/399/orig 2025-12-04T08:56:58.9883323Z * [new branch] gh/XuehaiPan/400/base -> origin/gh/XuehaiPan/400/base 2025-12-04T08:56:58.9884707Z * [new branch] gh/XuehaiPan/400/head -> origin/gh/XuehaiPan/400/head 2025-12-04T08:56:58.9886464Z * [new branch] gh/XuehaiPan/400/orig -> origin/gh/XuehaiPan/400/orig 2025-12-04T08:56:58.9888955Z * [new branch] gh/ZhiweiYan-96/39/base -> origin/gh/ZhiweiYan-96/39/base 2025-12-04T08:56:58.9890766Z * [new branch] gh/ZhiweiYan-96/39/head -> origin/gh/ZhiweiYan-96/39/head 2025-12-04T08:56:58.9892401Z * [new branch] gh/ZhiweiYan-96/39/orig -> origin/gh/ZhiweiYan-96/39/orig 2025-12-04T08:56:58.9894668Z * [new branch] gh/ZhiweiYan-96/44/base -> origin/gh/ZhiweiYan-96/44/base 2025-12-04T08:56:58.9896279Z * [new branch] gh/ZhiweiYan-96/44/head -> origin/gh/ZhiweiYan-96/44/head 2025-12-04T08:56:58.9898332Z * [new branch] gh/ZhiweiYan-96/45/base -> origin/gh/ZhiweiYan-96/45/base 2025-12-04T08:56:58.9899618Z * [new branch] gh/ZhiweiYan-96/45/head -> origin/gh/ZhiweiYan-96/45/head 2025-12-04T08:56:58.9901951Z * [new branch] gh/ZhiweiYan-96/49/base -> origin/gh/ZhiweiYan-96/49/base 2025-12-04T08:56:58.9903398Z * [new branch] gh/ZhiweiYan-96/49/head -> origin/gh/ZhiweiYan-96/49/head 2025-12-04T08:56:58.9905849Z * [new branch] gh/ZhiweiYan-96/62/base -> origin/gh/ZhiweiYan-96/62/base 2025-12-04T08:56:58.9907001Z * [new branch] gh/ZhiweiYan-96/62/head -> origin/gh/ZhiweiYan-96/62/head 2025-12-04T08:56:58.9909283Z * [new branch] gh/ZhiweiYan-96/66/base -> origin/gh/ZhiweiYan-96/66/base 2025-12-04T08:56:58.9910674Z * [new branch] gh/ZhiweiYan-96/66/head -> origin/gh/ZhiweiYan-96/66/head 2025-12-04T08:56:58.9912893Z * [new branch] gh/ZhiweiYan-96/67/base -> origin/gh/ZhiweiYan-96/67/base 2025-12-04T08:56:58.9914234Z * [new branch] gh/ZhiweiYan-96/67/head -> origin/gh/ZhiweiYan-96/67/head 2025-12-04T08:56:58.9916388Z * [new branch] gh/ZhiweiYan-96/68/base -> origin/gh/ZhiweiYan-96/68/base 2025-12-04T08:56:58.9917747Z * [new branch] gh/ZhiweiYan-96/68/head -> origin/gh/ZhiweiYan-96/68/head 2025-12-04T08:56:58.9919386Z * [new branch] gh/ZhiweiYan-96/68/orig -> origin/gh/ZhiweiYan-96/68/orig 2025-12-04T08:56:58.9921968Z * [new branch] gh/aakhundov/1/base -> origin/gh/aakhundov/1/base 2025-12-04T08:56:58.9923645Z * [new branch] gh/aakhundov/1/head -> origin/gh/aakhundov/1/head 2025-12-04T08:56:58.9925627Z * [new branch] gh/aakhundov/2/base -> origin/gh/aakhundov/2/base 2025-12-04T08:56:58.9927046Z * [new branch] gh/aakhundov/2/head -> origin/gh/aakhundov/2/head 2025-12-04T08:56:58.9929263Z * [new branch] gh/aditew01/openblas -> origin/gh/aditew01/openblas 2025-12-04T08:56:58.9930865Z * [new branch] gh/aditew01/sbgemm -> origin/gh/aditew01/sbgemm 2025-12-04T08:56:58.9932209Z * [new branch] gh/aditew01/vecbf16 -> origin/gh/aditew01/vecbf16 2025-12-04T08:56:58.9935473Z * [new branch] gh/albanD/4/base -> origin/gh/albanD/4/base 2025-12-04T08:56:58.9937059Z * [new branch] gh/albanD/4/head -> origin/gh/albanD/4/head 2025-12-04T08:56:58.9938686Z * [new branch] gh/albanD/4/orig -> origin/gh/albanD/4/orig 2025-12-04T08:56:58.9941021Z * [new branch] gh/alexbrauckmann/paddedtensor_faketensor_init -> origin/gh/alexbrauckmann/paddedtensor_faketensor_init 2025-12-04T08:56:58.9943337Z * [new branch] gh/alexsamardzic/12/base -> origin/gh/alexsamardzic/12/base 2025-12-04T08:56:58.9944762Z * [new branch] gh/alexsamardzic/12/head -> origin/gh/alexsamardzic/12/head 2025-12-04T08:56:58.9946395Z * [new branch] gh/alexsamardzic/12/orig -> origin/gh/alexsamardzic/12/orig 2025-12-04T08:56:58.9948550Z * [new branch] gh/alexsamardzic/14/base -> origin/gh/alexsamardzic/14/base 2025-12-04T08:56:58.9949950Z * [new branch] gh/alexsamardzic/14/head -> origin/gh/alexsamardzic/14/head 2025-12-04T08:56:58.9951604Z * [new branch] gh/alexsamardzic/14/orig -> origin/gh/alexsamardzic/14/orig 2025-12-04T08:56:58.9953781Z * [new branch] gh/alexsamardzic/15/base -> origin/gh/alexsamardzic/15/base 2025-12-04T08:56:58.9955142Z * [new branch] gh/alexsamardzic/15/head -> origin/gh/alexsamardzic/15/head 2025-12-04T08:56:58.9956891Z * [new branch] gh/alexsamardzic/15/orig -> origin/gh/alexsamardzic/15/orig 2025-12-04T08:56:58.9959599Z * [new branch] gh/amjames/18/base -> origin/gh/amjames/18/base 2025-12-04T08:56:58.9960967Z * [new branch] gh/amjames/18/head -> origin/gh/amjames/18/head 2025-12-04T08:56:58.9962700Z * [new branch] gh/amjames/18/orig -> origin/gh/amjames/18/orig 2025-12-04T08:56:58.9965582Z * [new branch] gh/andrewor14/35/base -> origin/gh/andrewor14/35/base 2025-12-04T08:56:58.9967323Z * [new branch] gh/andrewor14/35/head -> origin/gh/andrewor14/35/head 2025-12-04T08:56:58.9968688Z * [new branch] gh/andrewor14/35/orig -> origin/gh/andrewor14/35/orig 2025-12-04T08:56:58.9971097Z * [new branch] gh/andrewor14/50/base -> origin/gh/andrewor14/50/base 2025-12-04T08:56:58.9972804Z * [new branch] gh/andrewor14/50/head -> origin/gh/andrewor14/50/head 2025-12-04T08:56:58.9974510Z * [new branch] gh/andrewor14/50/orig -> origin/gh/andrewor14/50/orig 2025-12-04T08:56:58.9977042Z * [new branch] gh/andyanwang/30/base -> origin/gh/andyanwang/30/base 2025-12-04T08:56:58.9978781Z * [new branch] gh/andyanwang/30/orig -> origin/gh/andyanwang/30/orig 2025-12-04T08:56:58.9980942Z * [new branch] gh/andyanwang/31/base -> origin/gh/andyanwang/31/base 2025-12-04T08:56:58.9982628Z * [new branch] gh/andyanwang/31/orig -> origin/gh/andyanwang/31/orig 2025-12-04T08:56:58.9984722Z * [new branch] gh/andyanwang/39/base -> origin/gh/andyanwang/39/base 2025-12-04T08:56:58.9986385Z * [new branch] gh/andyanwang/39/head -> origin/gh/andyanwang/39/head 2025-12-04T08:56:58.9987808Z * [new branch] gh/andyanwang/39/orig -> origin/gh/andyanwang/39/orig 2025-12-04T08:56:58.9990189Z * [new branch] gh/andyanwang/42/base -> origin/gh/andyanwang/42/base 2025-12-04T08:56:58.9991815Z * [new branch] gh/andyanwang/42/head -> origin/gh/andyanwang/42/head 2025-12-04T08:56:58.9993518Z * [new branch] gh/andyanwang/42/orig -> origin/gh/andyanwang/42/orig 2025-12-04T08:56:58.9995563Z * [new branch] gh/andyanwang/45/base -> origin/gh/andyanwang/45/base 2025-12-04T08:56:58.9997267Z * [new branch] gh/andyanwang/45/head -> origin/gh/andyanwang/45/head 2025-12-04T08:56:58.9998656Z * [new branch] gh/andyanwang/45/orig -> origin/gh/andyanwang/45/orig 2025-12-04T08:56:59.0001305Z * [new branch] gh/angelayi/107/base -> origin/gh/angelayi/107/base 2025-12-04T08:56:59.0002797Z * [new branch] gh/angelayi/107/head -> origin/gh/angelayi/107/head 2025-12-04T08:56:59.0004940Z * [new branch] gh/angelayi/114/base -> origin/gh/angelayi/114/base 2025-12-04T08:56:59.0006641Z * [new branch] gh/angelayi/114/head -> origin/gh/angelayi/114/head 2025-12-04T08:56:59.0008026Z * [new branch] gh/angelayi/114/orig -> origin/gh/angelayi/114/orig 2025-12-04T08:56:59.0010296Z * [new branch] gh/angelayi/116/base -> origin/gh/angelayi/116/base 2025-12-04T08:56:59.0011690Z * [new branch] gh/angelayi/116/head -> origin/gh/angelayi/116/head 2025-12-04T08:56:59.0013491Z * [new branch] gh/angelayi/116/orig -> origin/gh/angelayi/116/orig 2025-12-04T08:56:59.0015627Z * [new branch] gh/angelayi/122/base -> origin/gh/angelayi/122/base 2025-12-04T08:56:59.0016961Z * [new branch] gh/angelayi/122/head -> origin/gh/angelayi/122/head 2025-12-04T08:56:59.0018708Z * [new branch] gh/angelayi/122/orig -> origin/gh/angelayi/122/orig 2025-12-04T08:56:59.0020855Z * [new branch] gh/angelayi/124/base -> origin/gh/angelayi/124/base 2025-12-04T08:56:59.0022209Z * [new branch] gh/angelayi/124/head -> origin/gh/angelayi/124/head 2025-12-04T08:56:59.0024093Z * [new branch] gh/angelayi/124/orig -> origin/gh/angelayi/124/orig 2025-12-04T08:56:59.0026113Z * [new branch] gh/angelayi/128/base -> origin/gh/angelayi/128/base 2025-12-04T08:56:59.0027519Z * [new branch] gh/angelayi/128/head -> origin/gh/angelayi/128/head 2025-12-04T08:56:59.0029186Z * [new branch] gh/angelayi/128/orig -> origin/gh/angelayi/128/orig 2025-12-04T08:56:59.0031227Z * [new branch] gh/angelayi/131/base -> origin/gh/angelayi/131/base 2025-12-04T08:56:59.0032796Z * [new branch] gh/angelayi/131/head -> origin/gh/angelayi/131/head 2025-12-04T08:56:59.0034388Z * [new branch] gh/angelayi/131/orig -> origin/gh/angelayi/131/orig 2025-12-04T08:56:59.0036719Z * [new branch] gh/angelayi/132/base -> origin/gh/angelayi/132/base 2025-12-04T08:56:59.0038639Z * [new branch] gh/angelayi/132/head -> origin/gh/angelayi/132/head 2025-12-04T08:56:59.0040438Z * [new branch] gh/angelayi/132/orig -> origin/gh/angelayi/132/orig 2025-12-04T08:56:59.0042411Z * [new branch] gh/angelayi/133/base -> origin/gh/angelayi/133/base 2025-12-04T08:56:59.0043983Z * [new branch] gh/angelayi/133/head -> origin/gh/angelayi/133/head 2025-12-04T08:56:59.0045524Z * [new branch] gh/angelayi/133/orig -> origin/gh/angelayi/133/orig 2025-12-04T08:56:59.0047871Z * [new branch] gh/angelayi/134/base -> origin/gh/angelayi/134/base 2025-12-04T08:56:59.0049729Z * [new branch] gh/angelayi/134/head -> origin/gh/angelayi/134/head 2025-12-04T08:56:59.0051096Z * [new branch] gh/angelayi/134/orig -> origin/gh/angelayi/134/orig 2025-12-04T08:56:59.0053620Z * [new branch] gh/angelayi/135/base -> origin/gh/angelayi/135/base 2025-12-04T08:56:59.0055276Z * [new branch] gh/angelayi/135/head -> origin/gh/angelayi/135/head 2025-12-04T08:56:59.0056659Z * [new branch] gh/angelayi/135/orig -> origin/gh/angelayi/135/orig 2025-12-04T08:56:59.0058825Z * [new branch] gh/angelayi/136/base -> origin/gh/angelayi/136/base 2025-12-04T08:56:59.0060279Z * [new branch] gh/angelayi/136/head -> origin/gh/angelayi/136/head 2025-12-04T08:56:59.0062290Z * [new branch] gh/angelayi/136/orig -> origin/gh/angelayi/136/orig 2025-12-04T08:56:59.0065157Z * [new branch] gh/angelayi/137/base -> origin/gh/angelayi/137/base 2025-12-04T08:56:59.0066570Z * [new branch] gh/angelayi/137/head -> origin/gh/angelayi/137/head 2025-12-04T08:56:59.0068395Z * [new branch] gh/angelayi/137/orig -> origin/gh/angelayi/137/orig 2025-12-04T08:56:59.0070387Z * [new branch] gh/angelayi/138/base -> origin/gh/angelayi/138/base 2025-12-04T08:56:59.0071727Z * [new branch] gh/angelayi/138/head -> origin/gh/angelayi/138/head 2025-12-04T08:56:59.0073468Z * [new branch] gh/angelayi/138/orig -> origin/gh/angelayi/138/orig 2025-12-04T08:56:59.0075479Z * [new branch] gh/angelayi/139/base -> origin/gh/angelayi/139/base 2025-12-04T08:56:59.0077075Z * [new branch] gh/angelayi/139/head -> origin/gh/angelayi/139/head 2025-12-04T08:56:59.0078761Z * [new branch] gh/angelayi/139/orig -> origin/gh/angelayi/139/orig 2025-12-04T08:56:59.0080831Z * [new branch] gh/angelayi/140/base -> origin/gh/angelayi/140/base 2025-12-04T08:56:59.0082517Z * [new branch] gh/angelayi/140/head -> origin/gh/angelayi/140/head 2025-12-04T08:56:59.0084147Z * [new branch] gh/angelayi/140/orig -> origin/gh/angelayi/140/orig 2025-12-04T08:56:59.0086592Z * [new branch] gh/angelayi/141/base -> origin/gh/angelayi/141/base 2025-12-04T08:56:59.0088331Z * [new branch] gh/angelayi/141/head -> origin/gh/angelayi/141/head 2025-12-04T08:56:59.0089712Z * [new branch] gh/angelayi/141/orig -> origin/gh/angelayi/141/orig 2025-12-04T08:56:59.0091987Z * [new branch] gh/angelayi/142/base -> origin/gh/angelayi/142/base 2025-12-04T08:56:59.0093647Z * [new branch] gh/angelayi/142/head -> origin/gh/angelayi/142/head 2025-12-04T08:56:59.0095077Z * [new branch] gh/angelayi/142/orig -> origin/gh/angelayi/142/orig 2025-12-04T08:56:59.0097364Z * [new branch] gh/angelayi/143/base -> origin/gh/angelayi/143/base 2025-12-04T08:56:59.0098723Z * [new branch] gh/angelayi/143/head -> origin/gh/angelayi/143/head 2025-12-04T08:56:59.0100407Z * [new branch] gh/angelayi/143/orig -> origin/gh/angelayi/143/orig 2025-12-04T08:56:59.0102541Z * [new branch] gh/angelayi/144/base -> origin/gh/angelayi/144/base 2025-12-04T08:56:59.0104237Z * [new branch] gh/angelayi/144/head -> origin/gh/angelayi/144/head 2025-12-04T08:56:59.0105914Z * [new branch] gh/angelayi/144/orig -> origin/gh/angelayi/144/orig 2025-12-04T08:56:59.0108556Z * [new branch] gh/anijain2305/753/base -> origin/gh/anijain2305/753/base 2025-12-04T08:56:59.0110156Z * [new branch] gh/anijain2305/753/head -> origin/gh/anijain2305/753/head 2025-12-04T08:56:59.0111567Z * [new branch] gh/anijain2305/753/orig -> origin/gh/anijain2305/753/orig 2025-12-04T08:56:59.0113948Z * [new branch] gh/anijain2305/810/base -> origin/gh/anijain2305/810/base 2025-12-04T08:56:59.0115369Z * [new branch] gh/anijain2305/810/head -> origin/gh/anijain2305/810/head 2025-12-04T08:56:59.0117079Z * [new branch] gh/anijain2305/810/orig -> origin/gh/anijain2305/810/orig 2025-12-04T08:56:59.0119141Z * [new branch] gh/anijain2305/854/base -> origin/gh/anijain2305/854/base 2025-12-04T08:56:59.0120749Z * [new branch] gh/anijain2305/854/head -> origin/gh/anijain2305/854/head 2025-12-04T08:56:59.0122352Z * [new branch] gh/anijain2305/854/orig -> origin/gh/anijain2305/854/orig 2025-12-04T08:56:59.0124552Z * [new branch] gh/anijain2305/864/base -> origin/gh/anijain2305/864/base 2025-12-04T08:56:59.0125984Z * [new branch] gh/anijain2305/864/head -> origin/gh/anijain2305/864/head 2025-12-04T08:56:59.0127571Z * [new branch] gh/anijain2305/864/orig -> origin/gh/anijain2305/864/orig 2025-12-04T08:56:59.0129803Z * [new branch] gh/anijain2305/870/base -> origin/gh/anijain2305/870/base 2025-12-04T08:56:59.0131197Z * [new branch] gh/anijain2305/870/head -> origin/gh/anijain2305/870/head 2025-12-04T08:56:59.0133017Z * [new branch] gh/anijain2305/870/orig -> origin/gh/anijain2305/870/orig 2025-12-04T08:56:59.0135244Z * [new branch] gh/anijain2305/873/base -> origin/gh/anijain2305/873/base 2025-12-04T08:56:59.0137324Z * [new branch] gh/anijain2305/873/head -> origin/gh/anijain2305/873/head 2025-12-04T08:56:59.0138813Z * [new branch] gh/anijain2305/873/orig -> origin/gh/anijain2305/873/orig 2025-12-04T08:56:59.0141125Z * [new branch] gh/anijain2305/894/base -> origin/gh/anijain2305/894/base 2025-12-04T08:56:59.0142565Z * [new branch] gh/anijain2305/894/head -> origin/gh/anijain2305/894/head 2025-12-04T08:56:59.0144287Z * [new branch] gh/anijain2305/894/orig -> origin/gh/anijain2305/894/orig 2025-12-04T08:56:59.0146405Z * [new branch] gh/anijain2305/895/base -> origin/gh/anijain2305/895/base 2025-12-04T08:56:59.0147799Z * [new branch] gh/anijain2305/895/head -> origin/gh/anijain2305/895/head 2025-12-04T08:56:59.0149263Z * [new branch] gh/anijain2305/895/orig -> origin/gh/anijain2305/895/orig 2025-12-04T08:56:59.0151483Z * [new branch] gh/anijain2305/910/base -> origin/gh/anijain2305/910/base 2025-12-04T08:56:59.0152986Z * [new branch] gh/anijain2305/910/head -> origin/gh/anijain2305/910/head 2025-12-04T08:56:59.0154530Z * [new branch] gh/anijain2305/910/orig -> origin/gh/anijain2305/910/orig 2025-12-04T08:56:59.0156731Z * [new branch] gh/anijain2305/919/base -> origin/gh/anijain2305/919/base 2025-12-04T08:56:59.0158323Z * [new branch] gh/anijain2305/919/head -> origin/gh/anijain2305/919/head 2025-12-04T08:56:59.0159879Z * [new branch] gh/anijain2305/919/orig -> origin/gh/anijain2305/919/orig 2025-12-04T08:56:59.0162228Z * [new branch] gh/anijain2305/922/base -> origin/gh/anijain2305/922/base 2025-12-04T08:56:59.0163968Z * [new branch] gh/anijain2305/922/head -> origin/gh/anijain2305/922/head 2025-12-04T08:56:59.0165539Z * [new branch] gh/anijain2305/922/orig -> origin/gh/anijain2305/922/orig 2025-12-04T08:56:59.0167656Z * [new branch] gh/anijain2305/932/base -> origin/gh/anijain2305/932/base 2025-12-04T08:56:59.0169287Z * [new branch] gh/anijain2305/932/head -> origin/gh/anijain2305/932/head 2025-12-04T08:56:59.0170932Z * [new branch] gh/anijain2305/932/orig -> origin/gh/anijain2305/932/orig 2025-12-04T08:56:59.0173262Z * [new branch] gh/anijain2305/940/base -> origin/gh/anijain2305/940/base 2025-12-04T08:56:59.0174775Z * [new branch] gh/anijain2305/940/head -> origin/gh/anijain2305/940/head 2025-12-04T08:56:59.0176323Z * [new branch] gh/anijain2305/940/orig -> origin/gh/anijain2305/940/orig 2025-12-04T08:56:59.0178459Z * [new branch] gh/anijain2305/941/base -> origin/gh/anijain2305/941/base 2025-12-04T08:56:59.0180029Z * [new branch] gh/anijain2305/941/head -> origin/gh/anijain2305/941/head 2025-12-04T08:56:59.0181605Z * [new branch] gh/anijain2305/941/orig -> origin/gh/anijain2305/941/orig 2025-12-04T08:56:59.0183776Z * [new branch] gh/anijain2305/942/base -> origin/gh/anijain2305/942/base 2025-12-04T08:56:59.0185360Z * [new branch] gh/anijain2305/942/head -> origin/gh/anijain2305/942/head 2025-12-04T08:56:59.0186968Z * [new branch] gh/anijain2305/942/orig -> origin/gh/anijain2305/942/orig 2025-12-04T08:56:59.0189086Z * [new branch] gh/anijain2305/943/base -> origin/gh/anijain2305/943/base 2025-12-04T08:56:59.0190688Z * [new branch] gh/anijain2305/943/head -> origin/gh/anijain2305/943/head 2025-12-04T08:56:59.0192305Z * [new branch] gh/anijain2305/943/orig -> origin/gh/anijain2305/943/orig 2025-12-04T08:56:59.0194943Z * [new branch] gh/anijain2305/944/base -> origin/gh/anijain2305/944/base 2025-12-04T08:56:59.0196506Z * [new branch] gh/anijain2305/944/head -> origin/gh/anijain2305/944/head 2025-12-04T08:56:59.0198083Z * [new branch] gh/anijain2305/944/orig -> origin/gh/anijain2305/944/orig 2025-12-04T08:56:59.0200686Z * [new branch] gh/anijain2305/945/base -> origin/gh/anijain2305/945/base 2025-12-04T08:56:59.0202299Z * [new branch] gh/anijain2305/945/head -> origin/gh/anijain2305/945/head 2025-12-04T08:56:59.0203815Z * [new branch] gh/anijain2305/945/orig -> origin/gh/anijain2305/945/orig 2025-12-04T08:56:59.0206051Z * [new branch] gh/anijain2305/946/base -> origin/gh/anijain2305/946/base 2025-12-04T08:56:59.0207608Z * [new branch] gh/anijain2305/946/head -> origin/gh/anijain2305/946/head 2025-12-04T08:56:59.0209171Z * [new branch] gh/anijain2305/946/orig -> origin/gh/anijain2305/946/orig 2025-12-04T08:56:59.0211345Z * [new branch] gh/anijain2305/947/base -> origin/gh/anijain2305/947/base 2025-12-04T08:56:59.0213165Z * [new branch] gh/anijain2305/947/head -> origin/gh/anijain2305/947/head 2025-12-04T08:56:59.0214803Z * [new branch] gh/anijain2305/947/orig -> origin/gh/anijain2305/947/orig 2025-12-04T08:56:59.0217346Z * [new branch] gh/anijain2305/948/base -> origin/gh/anijain2305/948/base 2025-12-04T08:56:59.0218952Z * [new branch] gh/anijain2305/948/head -> origin/gh/anijain2305/948/head 2025-12-04T08:56:59.0220527Z * [new branch] gh/anijain2305/948/orig -> origin/gh/anijain2305/948/orig 2025-12-04T08:56:59.0222648Z * [new branch] gh/anijain2305/949/base -> origin/gh/anijain2305/949/base 2025-12-04T08:56:59.0224206Z * [new branch] gh/anijain2305/949/head -> origin/gh/anijain2305/949/head 2025-12-04T08:56:59.0226019Z * [new branch] gh/anijain2305/949/orig -> origin/gh/anijain2305/949/orig 2025-12-04T08:56:59.0228280Z * [new branch] gh/anijain2305/950/base -> origin/gh/anijain2305/950/base 2025-12-04T08:56:59.0229883Z * [new branch] gh/anijain2305/950/head -> origin/gh/anijain2305/950/head 2025-12-04T08:56:59.0231419Z * [new branch] gh/anijain2305/950/orig -> origin/gh/anijain2305/950/orig 2025-12-04T08:56:59.0233678Z * [new branch] gh/anijain2305/951/base -> origin/gh/anijain2305/951/base 2025-12-04T08:56:59.0235211Z * [new branch] gh/anijain2305/951/head -> origin/gh/anijain2305/951/head 2025-12-04T08:56:59.0236811Z * [new branch] gh/anijain2305/951/orig -> origin/gh/anijain2305/951/orig 2025-12-04T08:56:59.0239031Z * [new branch] gh/anijain2305/952/base -> origin/gh/anijain2305/952/base 2025-12-04T08:56:59.0240592Z * [new branch] gh/anijain2305/952/head -> origin/gh/anijain2305/952/head 2025-12-04T08:56:59.0242126Z * [new branch] gh/anijain2305/952/orig -> origin/gh/anijain2305/952/orig 2025-12-04T08:56:59.0244301Z * [new branch] gh/anijain2305/953/base -> origin/gh/anijain2305/953/base 2025-12-04T08:56:59.0245852Z * [new branch] gh/anijain2305/953/head -> origin/gh/anijain2305/953/head 2025-12-04T08:56:59.0247465Z * [new branch] gh/anijain2305/953/orig -> origin/gh/anijain2305/953/orig 2025-12-04T08:56:59.0249702Z * [new branch] gh/anijain2305/954/base -> origin/gh/anijain2305/954/base 2025-12-04T08:56:59.0251552Z * [new branch] gh/anijain2305/954/head -> origin/gh/anijain2305/954/head 2025-12-04T08:56:59.0253035Z * [new branch] gh/anijain2305/954/orig -> origin/gh/anijain2305/954/orig 2025-12-04T08:56:59.0255233Z * [new branch] gh/anijain2305/955/base -> origin/gh/anijain2305/955/base 2025-12-04T08:56:59.0256825Z * [new branch] gh/anijain2305/955/head -> origin/gh/anijain2305/955/head 2025-12-04T08:56:59.0258359Z * [new branch] gh/anijain2305/955/orig -> origin/gh/anijain2305/955/orig 2025-12-04T08:56:59.0260577Z * [new branch] gh/anijain2305/956/base -> origin/gh/anijain2305/956/base 2025-12-04T08:56:59.0262238Z * [new branch] gh/anijain2305/956/head -> origin/gh/anijain2305/956/head 2025-12-04T08:56:59.0263936Z * [new branch] gh/anijain2305/956/orig -> origin/gh/anijain2305/956/orig 2025-12-04T08:56:59.0266197Z * [new branch] gh/anijain2305/957/base -> origin/gh/anijain2305/957/base 2025-12-04T08:56:59.0267794Z * [new branch] gh/anijain2305/957/head -> origin/gh/anijain2305/957/head 2025-12-04T08:56:59.0269383Z * [new branch] gh/anijain2305/957/orig -> origin/gh/anijain2305/957/orig 2025-12-04T08:56:59.0271647Z * [new branch] gh/anijain2305/958/base -> origin/gh/anijain2305/958/base 2025-12-04T08:56:59.0273319Z * [new branch] gh/anijain2305/958/head -> origin/gh/anijain2305/958/head 2025-12-04T08:56:59.0275142Z * [new branch] gh/anijain2305/958/orig -> origin/gh/anijain2305/958/orig 2025-12-04T08:56:59.0277195Z * [new branch] gh/anijain2305/959/base -> origin/gh/anijain2305/959/base 2025-12-04T08:56:59.0278745Z * [new branch] gh/anijain2305/959/head -> origin/gh/anijain2305/959/head 2025-12-04T08:56:59.0280802Z * [new branch] gh/anijain2305/959/orig -> origin/gh/anijain2305/959/orig 2025-12-04T08:56:59.0283029Z * [new branch] gh/anijain2305/960/base -> origin/gh/anijain2305/960/base 2025-12-04T08:56:59.0284623Z * [new branch] gh/anijain2305/960/head -> origin/gh/anijain2305/960/head 2025-12-04T08:56:59.0286199Z * [new branch] gh/anijain2305/960/orig -> origin/gh/anijain2305/960/orig 2025-12-04T08:56:59.0288384Z * [new branch] gh/anijain2305/961/base -> origin/gh/anijain2305/961/base 2025-12-04T08:56:59.0289989Z * [new branch] gh/anijain2305/961/head -> origin/gh/anijain2305/961/head 2025-12-04T08:56:59.0291571Z * [new branch] gh/anijain2305/961/orig -> origin/gh/anijain2305/961/orig 2025-12-04T08:56:59.0294094Z * [new branch] gh/anijain2305/962/base -> origin/gh/anijain2305/962/base 2025-12-04T08:56:59.0295621Z * [new branch] gh/anijain2305/962/head -> origin/gh/anijain2305/962/head 2025-12-04T08:56:59.0297143Z * [new branch] gh/anijain2305/962/orig -> origin/gh/anijain2305/962/orig 2025-12-04T08:56:59.0299635Z * [new branch] gh/anijain2305/963/base -> origin/gh/anijain2305/963/base 2025-12-04T08:56:59.0301313Z * [new branch] gh/anijain2305/963/head -> origin/gh/anijain2305/963/head 2025-12-04T08:56:59.0302924Z * [new branch] gh/anijain2305/963/orig -> origin/gh/anijain2305/963/orig 2025-12-04T08:56:59.0305227Z * [new branch] gh/anijain2305/964/base -> origin/gh/anijain2305/964/base 2025-12-04T08:56:59.0306858Z * [new branch] gh/anijain2305/964/head -> origin/gh/anijain2305/964/head 2025-12-04T08:56:59.0308364Z * [new branch] gh/anijain2305/964/orig -> origin/gh/anijain2305/964/orig 2025-12-04T08:56:59.0310509Z * [new branch] gh/anijain2305/965/base -> origin/gh/anijain2305/965/base 2025-12-04T08:56:59.0312072Z * [new branch] gh/anijain2305/965/head -> origin/gh/anijain2305/965/head 2025-12-04T08:56:59.0313672Z * [new branch] gh/anijain2305/965/orig -> origin/gh/anijain2305/965/orig 2025-12-04T08:56:59.0316679Z * [new branch] gh/anijain2305/966/base -> origin/gh/anijain2305/966/base 2025-12-04T08:56:59.0319036Z * [new branch] gh/anijain2305/966/head -> origin/gh/anijain2305/966/head 2025-12-04T08:56:59.0321502Z * [new branch] gh/anijain2305/966/orig -> origin/gh/anijain2305/966/orig 2025-12-04T08:56:59.0324867Z * [new branch] gh/anijain2305/967/base -> origin/gh/anijain2305/967/base 2025-12-04T08:56:59.0327217Z * [new branch] gh/anijain2305/967/head -> origin/gh/anijain2305/967/head 2025-12-04T08:56:59.0329549Z * [new branch] gh/anijain2305/967/orig -> origin/gh/anijain2305/967/orig 2025-12-04T08:56:59.0332576Z * [new branch] gh/anijain2305/968/base -> origin/gh/anijain2305/968/base 2025-12-04T08:56:59.0334883Z * [new branch] gh/anijain2305/968/head -> origin/gh/anijain2305/968/head 2025-12-04T08:56:59.0336987Z * [new branch] gh/anijain2305/968/orig -> origin/gh/anijain2305/968/orig 2025-12-04T08:56:59.0339945Z * [new branch] gh/anijain2305/969/base -> origin/gh/anijain2305/969/base 2025-12-04T08:56:59.0342216Z * [new branch] gh/anijain2305/969/head -> origin/gh/anijain2305/969/head 2025-12-04T08:56:59.0344399Z * [new branch] gh/anijain2305/969/orig -> origin/gh/anijain2305/969/orig 2025-12-04T08:56:59.0347798Z * [new branch] gh/anijain2305/970/base -> origin/gh/anijain2305/970/base 2025-12-04T08:56:59.0349863Z * [new branch] gh/anijain2305/970/head -> origin/gh/anijain2305/970/head 2025-12-04T08:56:59.0352150Z * [new branch] gh/anijain2305/970/orig -> origin/gh/anijain2305/970/orig 2025-12-04T08:56:59.0355984Z * [new branch] gh/anjali411/216/base -> origin/gh/anjali411/216/base 2025-12-04T08:56:59.0357416Z * [new branch] gh/anjali411/216/head -> origin/gh/anjali411/216/head 2025-12-04T08:56:59.0359052Z * [new branch] gh/anjali411/216/orig -> origin/gh/anjali411/216/orig 2025-12-04T08:56:59.0361760Z * [new branch] gh/anshul-si/1/base -> origin/gh/anshul-si/1/base 2025-12-04T08:56:59.0363674Z * [new branch] gh/anshul-si/1/head -> origin/gh/anshul-si/1/head 2025-12-04T08:56:59.0365642Z * [new branch] gh/anshul-si/2/base -> origin/gh/anshul-si/2/base 2025-12-04T08:56:59.0367207Z * [new branch] gh/anshul-si/2/head -> origin/gh/anshul-si/2/head 2025-12-04T08:56:59.0369266Z * [new branch] gh/anshul-si/3/base -> origin/gh/anshul-si/3/base 2025-12-04T08:56:59.0370765Z * [new branch] gh/anshul-si/3/head -> origin/gh/anshul-si/3/head 2025-12-04T08:56:59.0372869Z * [new branch] gh/anshul-si/4/base -> origin/gh/anshul-si/4/base 2025-12-04T08:56:59.0374392Z * [new branch] gh/anshul-si/4/head -> origin/gh/anshul-si/4/head 2025-12-04T08:56:59.0376304Z * [new branch] gh/anshul-si/5/base -> origin/gh/anshul-si/5/base 2025-12-04T08:56:59.0377865Z * [new branch] gh/anshul-si/5/head -> origin/gh/anshul-si/5/head 2025-12-04T08:56:59.0380124Z * [new branch] gh/anshul-si/53/base -> origin/gh/anshul-si/53/base 2025-12-04T08:56:59.0381686Z * [new branch] gh/anshul-si/53/head -> origin/gh/anshul-si/53/head 2025-12-04T08:56:59.0383860Z * [new branch] gh/anshul-si/58/base -> origin/gh/anshul-si/58/base 2025-12-04T08:56:59.0385524Z * [new branch] gh/anshul-si/58/head -> origin/gh/anshul-si/58/head 2025-12-04T08:56:59.0387544Z * [new branch] gh/anshul-si/66/base -> origin/gh/anshul-si/66/base 2025-12-04T08:56:59.0389225Z * [new branch] gh/anshul-si/66/head -> origin/gh/anshul-si/66/head 2025-12-04T08:56:59.0390833Z * [new branch] gh/anshul-si/66/orig -> origin/gh/anshul-si/66/orig 2025-12-04T08:56:59.0392840Z * [new branch] gh/anshul-si/67/base -> origin/gh/anshul-si/67/base 2025-12-04T08:56:59.0394411Z * [new branch] gh/anshul-si/67/head -> origin/gh/anshul-si/67/head 2025-12-04T08:56:59.0395951Z * [new branch] gh/anshul-si/67/orig -> origin/gh/anshul-si/67/orig 2025-12-04T08:56:59.0398261Z * [new branch] gh/anshul-si/68/base -> origin/gh/anshul-si/68/base 2025-12-04T08:56:59.0399828Z * [new branch] gh/anshul-si/68/head -> origin/gh/anshul-si/68/head 2025-12-04T08:56:59.0401398Z * [new branch] gh/anshul-si/68/orig -> origin/gh/anshul-si/68/orig 2025-12-04T08:56:59.0403688Z * [new branch] gh/anshul-si/69/base -> origin/gh/anshul-si/69/base 2025-12-04T08:56:59.0405203Z * [new branch] gh/anshul-si/69/head -> origin/gh/anshul-si/69/head 2025-12-04T08:56:59.0406785Z * [new branch] gh/anshul-si/69/orig -> origin/gh/anshul-si/69/orig 2025-12-04T08:56:59.0408908Z * [new branch] gh/anshul-si/70/base -> origin/gh/anshul-si/70/base 2025-12-04T08:56:59.0410594Z * [new branch] gh/anshul-si/70/head -> origin/gh/anshul-si/70/head 2025-12-04T08:56:59.0412191Z * [new branch] gh/anshul-si/70/orig -> origin/gh/anshul-si/70/orig 2025-12-04T08:56:59.0414588Z * [new branch] gh/anshul-si/71/base -> origin/gh/anshul-si/71/base 2025-12-04T08:56:59.0415909Z * [new branch] gh/anshul-si/71/head -> origin/gh/anshul-si/71/head 2025-12-04T08:56:59.0417670Z * [new branch] gh/anshul-si/71/orig -> origin/gh/anshul-si/71/orig 2025-12-04T08:56:59.0419754Z * [new branch] gh/anshul-si/72/base -> origin/gh/anshul-si/72/base 2025-12-04T08:56:59.0421370Z * [new branch] gh/anshul-si/72/head -> origin/gh/anshul-si/72/head 2025-12-04T08:56:59.0422840Z * [new branch] gh/anshul-si/72/orig -> origin/gh/anshul-si/72/orig 2025-12-04T08:56:59.0425034Z * [new branch] gh/anshul-si/73/base -> origin/gh/anshul-si/73/base 2025-12-04T08:56:59.0426657Z * [new branch] gh/anshul-si/73/head -> origin/gh/anshul-si/73/head 2025-12-04T08:56:59.0428252Z * [new branch] gh/anshul-si/73/orig -> origin/gh/anshul-si/73/orig 2025-12-04T08:56:59.0431002Z * [new branch] gh/aorenste/132/base -> origin/gh/aorenste/132/base 2025-12-04T08:56:59.0432584Z * [new branch] gh/aorenste/132/head -> origin/gh/aorenste/132/head 2025-12-04T08:56:59.0434944Z * [new branch] gh/aorenste/134/base -> origin/gh/aorenste/134/base 2025-12-04T08:56:59.0436647Z * [new branch] gh/aorenste/134/head -> origin/gh/aorenste/134/head 2025-12-04T08:56:59.0438197Z * [new branch] gh/aorenste/134/orig -> origin/gh/aorenste/134/orig 2025-12-04T08:56:59.0440388Z * [new branch] gh/aorenste/139/base -> origin/gh/aorenste/139/base 2025-12-04T08:56:59.0441960Z * [new branch] gh/aorenste/139/head -> origin/gh/aorenste/139/head 2025-12-04T08:56:59.0443506Z * [new branch] gh/aorenste/139/orig -> origin/gh/aorenste/139/orig 2025-12-04T08:56:59.0445583Z * [new branch] gh/aorenste/141/base -> origin/gh/aorenste/141/base 2025-12-04T08:56:59.0447139Z * [new branch] gh/aorenste/141/head -> origin/gh/aorenste/141/head 2025-12-04T08:56:59.0449446Z * [new branch] gh/aorenste/145/base -> origin/gh/aorenste/145/base 2025-12-04T08:56:59.0451036Z * [new branch] gh/aorenste/145/head -> origin/gh/aorenste/145/head 2025-12-04T08:56:59.0453034Z * [new branch] gh/aorenste/145/orig -> origin/gh/aorenste/145/orig 2025-12-04T08:56:59.0455215Z * [new branch] gh/aorenste/146/base -> origin/gh/aorenste/146/base 2025-12-04T08:56:59.0456927Z * [new branch] gh/aorenste/146/head -> origin/gh/aorenste/146/head 2025-12-04T08:56:59.0458554Z * [new branch] gh/aorenste/146/orig -> origin/gh/aorenste/146/orig 2025-12-04T08:56:59.0460757Z * [new branch] gh/aorenste/147/base -> origin/gh/aorenste/147/base 2025-12-04T08:56:59.0462555Z * [new branch] gh/aorenste/147/head -> origin/gh/aorenste/147/head 2025-12-04T08:56:59.0464147Z * [new branch] gh/aorenste/147/orig -> origin/gh/aorenste/147/orig 2025-12-04T08:56:59.0466282Z * [new branch] gh/aorenste/148/base -> origin/gh/aorenste/148/base 2025-12-04T08:56:59.0467877Z * [new branch] gh/aorenste/148/head -> origin/gh/aorenste/148/head 2025-12-04T08:56:59.0469531Z * [new branch] gh/aorenste/148/orig -> origin/gh/aorenste/148/orig 2025-12-04T08:56:59.0471626Z * [new branch] gh/aorenste/149/base -> origin/gh/aorenste/149/base 2025-12-04T08:56:59.0473231Z * [new branch] gh/aorenste/149/head -> origin/gh/aorenste/149/head 2025-12-04T08:56:59.0474848Z * [new branch] gh/aorenste/149/orig -> origin/gh/aorenste/149/orig 2025-12-04T08:56:59.0477451Z * [new branch] gh/aorenste/150/base -> origin/gh/aorenste/150/base 2025-12-04T08:56:59.0479169Z * [new branch] gh/aorenste/150/head -> origin/gh/aorenste/150/head 2025-12-04T08:56:59.0480631Z * [new branch] gh/aorenste/150/orig -> origin/gh/aorenste/150/orig 2025-12-04T08:56:59.0482652Z * [new branch] gh/aorenste/151/base -> origin/gh/aorenste/151/base 2025-12-04T08:56:59.0484233Z * [new branch] gh/aorenste/151/head -> origin/gh/aorenste/151/head 2025-12-04T08:56:59.0485897Z * [new branch] gh/aorenste/151/orig -> origin/gh/aorenste/151/orig 2025-12-04T08:56:59.0488055Z * [new branch] gh/aorenste/152/base -> origin/gh/aorenste/152/base 2025-12-04T08:56:59.0489583Z * [new branch] gh/aorenste/152/head -> origin/gh/aorenste/152/head 2025-12-04T08:56:59.0491127Z * [new branch] gh/aorenste/152/orig -> origin/gh/aorenste/152/orig 2025-12-04T08:56:59.0493339Z * [new branch] gh/aorenste/153/base -> origin/gh/aorenste/153/base 2025-12-04T08:56:59.0494860Z * [new branch] gh/aorenste/153/head -> origin/gh/aorenste/153/head 2025-12-04T08:56:59.0496509Z * [new branch] gh/aorenste/153/orig -> origin/gh/aorenste/153/orig 2025-12-04T08:56:59.0498456Z * [new branch] gh/aorenste/154/base -> origin/gh/aorenste/154/base 2025-12-04T08:56:59.0500007Z * [new branch] gh/aorenste/154/head -> origin/gh/aorenste/154/head 2025-12-04T08:56:59.0501668Z * [new branch] gh/aorenste/154/orig -> origin/gh/aorenste/154/orig 2025-12-04T08:56:59.0503479Z * [new branch] gh/aorenste/155/base -> origin/gh/aorenste/155/base 2025-12-04T08:56:59.0505047Z * [new branch] gh/aorenste/155/head -> origin/gh/aorenste/155/head 2025-12-04T08:56:59.0506562Z * [new branch] gh/aorenste/155/orig -> origin/gh/aorenste/155/orig 2025-12-04T08:56:59.0508608Z * [new branch] gh/aorenste/156/base -> origin/gh/aorenste/156/base 2025-12-04T08:56:59.0510135Z * [new branch] gh/aorenste/156/head -> origin/gh/aorenste/156/head 2025-12-04T08:56:59.0511584Z * [new branch] gh/aorenste/156/orig -> origin/gh/aorenste/156/orig 2025-12-04T08:56:59.0514127Z * [new branch] gh/aorenste/157/base -> origin/gh/aorenste/157/base 2025-12-04T08:56:59.0515781Z * [new branch] gh/aorenste/157/head -> origin/gh/aorenste/157/head 2025-12-04T08:56:59.0517347Z * [new branch] gh/aorenste/157/orig -> origin/gh/aorenste/157/orig 2025-12-04T08:56:59.0519485Z * [new branch] gh/aorenste/158/base -> origin/gh/aorenste/158/base 2025-12-04T08:56:59.0521080Z * [new branch] gh/aorenste/158/head -> origin/gh/aorenste/158/head 2025-12-04T08:56:59.0522595Z * [new branch] gh/aorenste/158/orig -> origin/gh/aorenste/158/orig 2025-12-04T08:56:59.0524611Z * [new branch] gh/aorenste/159/base -> origin/gh/aorenste/159/base 2025-12-04T08:56:59.0526215Z * [new branch] gh/aorenste/159/head -> origin/gh/aorenste/159/head 2025-12-04T08:56:59.0527698Z * [new branch] gh/aorenste/159/orig -> origin/gh/aorenste/159/orig 2025-12-04T08:56:59.0530302Z * [new branch] gh/avikchaudhuri/1/base -> origin/gh/avikchaudhuri/1/base 2025-12-04T08:56:59.0531881Z * [new branch] gh/avikchaudhuri/1/head -> origin/gh/avikchaudhuri/1/head 2025-12-04T08:56:59.0533997Z * [new branch] gh/avikchaudhuri/2/base -> origin/gh/avikchaudhuri/2/base 2025-12-04T08:56:59.0535522Z * [new branch] gh/avikchaudhuri/2/head -> origin/gh/avikchaudhuri/2/head 2025-12-04T08:56:59.0537087Z * [new branch] gh/avikchaudhuri/2/orig -> origin/gh/avikchaudhuri/2/orig 2025-12-04T08:56:59.0539951Z * [new branch] gh/bdhirsh/666/base -> origin/gh/bdhirsh/666/base 2025-12-04T08:56:59.0541668Z * [new branch] gh/bdhirsh/666/head -> origin/gh/bdhirsh/666/head 2025-12-04T08:56:59.0543127Z * [new branch] gh/bdhirsh/666/orig -> origin/gh/bdhirsh/666/orig 2025-12-04T08:56:59.0545230Z * [new branch] gh/bdhirsh/668/base -> origin/gh/bdhirsh/668/base 2025-12-04T08:56:59.0546787Z * [new branch] gh/bdhirsh/668/head -> origin/gh/bdhirsh/668/head 2025-12-04T08:56:59.0548312Z * [new branch] gh/bdhirsh/668/orig -> origin/gh/bdhirsh/668/orig 2025-12-04T08:56:59.0550522Z * [new branch] gh/bdhirsh/669/base -> origin/gh/bdhirsh/669/base 2025-12-04T08:56:59.0552114Z * [new branch] gh/bdhirsh/669/head -> origin/gh/bdhirsh/669/head 2025-12-04T08:56:59.0553629Z * [new branch] gh/bdhirsh/669/orig -> origin/gh/bdhirsh/669/orig 2025-12-04T08:56:59.0555847Z * [new branch] gh/bdhirsh/670/base -> origin/gh/bdhirsh/670/base 2025-12-04T08:56:59.0557594Z * [new branch] gh/bdhirsh/670/head -> origin/gh/bdhirsh/670/head 2025-12-04T08:56:59.0559159Z * [new branch] gh/bdhirsh/670/orig -> origin/gh/bdhirsh/670/orig 2025-12-04T08:56:59.0561328Z * [new branch] gh/bdhirsh/672/base -> origin/gh/bdhirsh/672/base 2025-12-04T08:56:59.0564527Z * [new branch] gh/bdhirsh/672/head -> origin/gh/bdhirsh/672/head 2025-12-04T08:56:59.0566136Z * [new branch] gh/bdhirsh/672/orig -> origin/gh/bdhirsh/672/orig 2025-12-04T08:56:59.0568410Z * [new branch] gh/bdhirsh/675/base -> origin/gh/bdhirsh/675/base 2025-12-04T08:56:59.0570088Z * [new branch] gh/bdhirsh/675/head -> origin/gh/bdhirsh/675/head 2025-12-04T08:56:59.0571643Z * [new branch] gh/bdhirsh/675/orig -> origin/gh/bdhirsh/675/orig 2025-12-04T08:56:59.0581946Z * [new branch] gh/bdhirsh/676/base -> origin/gh/bdhirsh/676/base 2025-12-04T08:56:59.0582457Z * [new branch] gh/bdhirsh/676/head -> origin/gh/bdhirsh/676/head 2025-12-04T08:56:59.0582953Z * [new branch] gh/bdhirsh/676/orig -> origin/gh/bdhirsh/676/orig 2025-12-04T08:56:59.0583446Z * [new branch] gh/bdhirsh/677/base -> origin/gh/bdhirsh/677/base 2025-12-04T08:56:59.0583635Z * [new branch] gh/bdhirsh/677/head -> origin/gh/bdhirsh/677/head 2025-12-04T08:56:59.0583838Z * [new branch] gh/bdhirsh/677/orig -> origin/gh/bdhirsh/677/orig 2025-12-04T08:56:59.0586052Z * [new branch] gh/bdhirsh/678/base -> origin/gh/bdhirsh/678/base 2025-12-04T08:56:59.0587772Z * [new branch] gh/bdhirsh/678/head -> origin/gh/bdhirsh/678/head 2025-12-04T08:56:59.0589293Z * [new branch] gh/bdhirsh/678/orig -> origin/gh/bdhirsh/678/orig 2025-12-04T08:56:59.0591500Z * [new branch] gh/bdhirsh/679/base -> origin/gh/bdhirsh/679/base 2025-12-04T08:56:59.0593139Z * [new branch] gh/bdhirsh/679/head -> origin/gh/bdhirsh/679/head 2025-12-04T08:56:59.0594657Z * [new branch] gh/bdhirsh/679/orig -> origin/gh/bdhirsh/679/orig 2025-12-04T08:56:59.0596830Z * [new branch] gh/bdhirsh/680/base -> origin/gh/bdhirsh/680/base 2025-12-04T08:56:59.0598471Z * [new branch] gh/bdhirsh/680/head -> origin/gh/bdhirsh/680/head 2025-12-04T08:56:59.0600042Z * [new branch] gh/bdhirsh/680/orig -> origin/gh/bdhirsh/680/orig 2025-12-04T08:56:59.0602036Z * [new branch] gh/bdhirsh/681/base -> origin/gh/bdhirsh/681/base 2025-12-04T08:56:59.0603774Z * [new branch] gh/bdhirsh/681/head -> origin/gh/bdhirsh/681/head 2025-12-04T08:56:59.0605332Z * [new branch] gh/bdhirsh/681/orig -> origin/gh/bdhirsh/681/orig 2025-12-04T08:56:59.0608160Z * [new branch] gh/benjaminglass1/101/base -> origin/gh/benjaminglass1/101/base 2025-12-04T08:56:59.0609621Z * [new branch] gh/benjaminglass1/101/head -> origin/gh/benjaminglass1/101/head 2025-12-04T08:56:59.0611158Z * [new branch] gh/benjaminglass1/101/orig -> origin/gh/benjaminglass1/101/orig 2025-12-04T08:56:59.0613368Z * [new branch] gh/benjaminglass1/102/base -> origin/gh/benjaminglass1/102/base 2025-12-04T08:56:59.0615126Z * [new branch] gh/benjaminglass1/102/head -> origin/gh/benjaminglass1/102/head 2025-12-04T08:56:59.0616693Z * [new branch] gh/benjaminglass1/102/orig -> origin/gh/benjaminglass1/102/orig 2025-12-04T08:56:59.0618796Z * [new branch] gh/benjaminglass1/106/base -> origin/gh/benjaminglass1/106/base 2025-12-04T08:56:59.0620373Z * [new branch] gh/benjaminglass1/106/head -> origin/gh/benjaminglass1/106/head 2025-12-04T08:56:59.0621949Z * [new branch] gh/benjaminglass1/106/orig -> origin/gh/benjaminglass1/106/orig 2025-12-04T08:56:59.0624131Z * [new branch] gh/benjaminglass1/107/base -> origin/gh/benjaminglass1/107/base 2025-12-04T08:56:59.0625662Z * [new branch] gh/benjaminglass1/107/head -> origin/gh/benjaminglass1/107/head 2025-12-04T08:56:59.0627228Z * [new branch] gh/benjaminglass1/107/orig -> origin/gh/benjaminglass1/107/orig 2025-12-04T08:56:59.0629280Z * [new branch] gh/benjaminglass1/108/base -> origin/gh/benjaminglass1/108/base 2025-12-04T08:56:59.0630926Z * [new branch] gh/benjaminglass1/108/head -> origin/gh/benjaminglass1/108/head 2025-12-04T08:56:59.0632597Z * [new branch] gh/benjaminglass1/108/orig -> origin/gh/benjaminglass1/108/orig 2025-12-04T08:56:59.0634632Z * [new branch] gh/benjaminglass1/109/base -> origin/gh/benjaminglass1/109/base 2025-12-04T08:56:59.0636137Z * [new branch] gh/benjaminglass1/109/head -> origin/gh/benjaminglass1/109/head 2025-12-04T08:56:59.0638214Z * [new branch] gh/benjaminglass1/109/orig -> origin/gh/benjaminglass1/109/orig 2025-12-04T08:56:59.0640359Z * [new branch] gh/benjaminglass1/97/base -> origin/gh/benjaminglass1/97/base 2025-12-04T08:56:59.0641896Z * [new branch] gh/benjaminglass1/97/head -> origin/gh/benjaminglass1/97/head 2025-12-04T08:56:59.0643381Z * [new branch] gh/benjaminglass1/97/orig -> origin/gh/benjaminglass1/97/orig 2025-12-04T08:56:59.0645965Z * [new branch] gh/bobrenjc93/570/base -> origin/gh/bobrenjc93/570/base 2025-12-04T08:56:59.0647635Z * [new branch] gh/bobrenjc93/570/head -> origin/gh/bobrenjc93/570/head 2025-12-04T08:56:59.0649163Z * [new branch] gh/bobrenjc93/570/orig -> origin/gh/bobrenjc93/570/orig 2025-12-04T08:56:59.0651160Z * [new branch] gh/bobrenjc93/604/base -> origin/gh/bobrenjc93/604/base 2025-12-04T08:56:59.0652870Z * [new branch] gh/bobrenjc93/604/head -> origin/gh/bobrenjc93/604/head 2025-12-04T08:56:59.0654379Z * [new branch] gh/bobrenjc93/604/orig -> origin/gh/bobrenjc93/604/orig 2025-12-04T08:56:59.0656643Z * [new branch] gh/bobrenjc93/638/base -> origin/gh/bobrenjc93/638/base 2025-12-04T08:56:59.0658170Z * [new branch] gh/bobrenjc93/638/head -> origin/gh/bobrenjc93/638/head 2025-12-04T08:56:59.0659712Z * [new branch] gh/bobrenjc93/638/orig -> origin/gh/bobrenjc93/638/orig 2025-12-04T08:56:59.0661936Z * [new branch] gh/bobrenjc93/653/base -> origin/gh/bobrenjc93/653/base 2025-12-04T08:56:59.0663522Z * [new branch] gh/bobrenjc93/653/head -> origin/gh/bobrenjc93/653/head 2025-12-04T08:56:59.0665004Z * [new branch] gh/bobrenjc93/653/orig -> origin/gh/bobrenjc93/653/orig 2025-12-04T08:56:59.0667199Z * [new branch] gh/bobrenjc93/654/base -> origin/gh/bobrenjc93/654/base 2025-12-04T08:56:59.0668874Z * [new branch] gh/bobrenjc93/654/head -> origin/gh/bobrenjc93/654/head 2025-12-04T08:56:59.0670337Z * [new branch] gh/bobrenjc93/654/orig -> origin/gh/bobrenjc93/654/orig 2025-12-04T08:56:59.0672363Z * [new branch] gh/bobrenjc93/657/base -> origin/gh/bobrenjc93/657/base 2025-12-04T08:56:59.0673926Z * [new branch] gh/bobrenjc93/657/head -> origin/gh/bobrenjc93/657/head 2025-12-04T08:56:59.0675503Z * [new branch] gh/bobrenjc93/657/orig -> origin/gh/bobrenjc93/657/orig 2025-12-04T08:56:59.0677940Z * [new branch] gh/bobrenjc93/672/base -> origin/gh/bobrenjc93/672/base 2025-12-04T08:56:59.0679447Z * [new branch] gh/bobrenjc93/672/head -> origin/gh/bobrenjc93/672/head 2025-12-04T08:56:59.0680968Z * [new branch] gh/bobrenjc93/672/orig -> origin/gh/bobrenjc93/672/orig 2025-12-04T08:56:59.0683039Z * [new branch] gh/bobrenjc93/679/base -> origin/gh/bobrenjc93/679/base 2025-12-04T08:56:59.0684805Z * [new branch] gh/bobrenjc93/679/head -> origin/gh/bobrenjc93/679/head 2025-12-04T08:56:59.0686396Z * [new branch] gh/bobrenjc93/679/orig -> origin/gh/bobrenjc93/679/orig 2025-12-04T08:56:59.0688656Z * [new branch] gh/bobrenjc93/680/base -> origin/gh/bobrenjc93/680/base 2025-12-04T08:56:59.0690290Z * [new branch] gh/bobrenjc93/680/head -> origin/gh/bobrenjc93/680/head 2025-12-04T08:56:59.0691782Z * [new branch] gh/bobrenjc93/680/orig -> origin/gh/bobrenjc93/680/orig 2025-12-04T08:56:59.0693934Z * [new branch] gh/bobrenjc93/681/base -> origin/gh/bobrenjc93/681/base 2025-12-04T08:56:59.0695592Z * [new branch] gh/bobrenjc93/681/head -> origin/gh/bobrenjc93/681/head 2025-12-04T08:56:59.0697018Z * [new branch] gh/bobrenjc93/681/orig -> origin/gh/bobrenjc93/681/orig 2025-12-04T08:56:59.0699014Z * [new branch] gh/bobrenjc93/682/base -> origin/gh/bobrenjc93/682/base 2025-12-04T08:56:59.0700598Z * [new branch] gh/bobrenjc93/682/head -> origin/gh/bobrenjc93/682/head 2025-12-04T08:56:59.0702174Z * [new branch] gh/bobrenjc93/682/orig -> origin/gh/bobrenjc93/682/orig 2025-12-04T08:56:59.0704260Z * [new branch] gh/bobrenjc93/683/base -> origin/gh/bobrenjc93/683/base 2025-12-04T08:56:59.0705835Z * [new branch] gh/bobrenjc93/683/head -> origin/gh/bobrenjc93/683/head 2025-12-04T08:56:59.0707361Z * [new branch] gh/bobrenjc93/683/orig -> origin/gh/bobrenjc93/683/orig 2025-12-04T08:56:59.0709623Z * [new branch] gh/bobrenjc93/684/base -> origin/gh/bobrenjc93/684/base 2025-12-04T08:56:59.0711302Z * [new branch] gh/bobrenjc93/684/head -> origin/gh/bobrenjc93/684/head 2025-12-04T08:56:59.0713291Z * [new branch] gh/bobrenjc93/684/orig -> origin/gh/bobrenjc93/684/orig 2025-12-04T08:56:59.0714973Z * [new branch] gh/bobrenjc93/685/base -> origin/gh/bobrenjc93/685/base 2025-12-04T08:56:59.0716678Z * [new branch] gh/bobrenjc93/685/head -> origin/gh/bobrenjc93/685/head 2025-12-04T08:56:59.0718555Z * [new branch] gh/bobrenjc93/685/orig -> origin/gh/bobrenjc93/685/orig 2025-12-04T08:56:59.0720806Z * [new branch] gh/bobrenjc93/686/base -> origin/gh/bobrenjc93/686/base 2025-12-04T08:56:59.0722381Z * [new branch] gh/bobrenjc93/686/head -> origin/gh/bobrenjc93/686/head 2025-12-04T08:56:59.0724074Z * [new branch] gh/bobrenjc93/686/orig -> origin/gh/bobrenjc93/686/orig 2025-12-04T08:56:59.0725960Z * [new branch] gh/bobrenjc93/687/base -> origin/gh/bobrenjc93/687/base 2025-12-04T08:56:59.0727703Z * [new branch] gh/bobrenjc93/687/head -> origin/gh/bobrenjc93/687/head 2025-12-04T08:56:59.0731302Z * [new branch] gh/bobrenjc93/687/orig -> origin/gh/bobrenjc93/687/orig 2025-12-04T08:56:59.0733035Z * [new branch] gh/bobrenjc93/688/base -> origin/gh/bobrenjc93/688/base 2025-12-04T08:56:59.0734238Z * [new branch] gh/bobrenjc93/688/head -> origin/gh/bobrenjc93/688/head 2025-12-04T08:56:59.0735958Z * [new branch] gh/bobrenjc93/688/orig -> origin/gh/bobrenjc93/688/orig 2025-12-04T08:56:59.0737960Z * [new branch] gh/bobrenjc93/689/base -> origin/gh/bobrenjc93/689/base 2025-12-04T08:56:59.0739621Z * [new branch] gh/bobrenjc93/689/head -> origin/gh/bobrenjc93/689/head 2025-12-04T08:56:59.0741306Z * [new branch] gh/bobrenjc93/689/orig -> origin/gh/bobrenjc93/689/orig 2025-12-04T08:56:59.0743305Z * [new branch] gh/bobrenjc93/690/base -> origin/gh/bobrenjc93/690/base 2025-12-04T08:56:59.0744894Z * [new branch] gh/bobrenjc93/690/head -> origin/gh/bobrenjc93/690/head 2025-12-04T08:56:59.0746513Z * [new branch] gh/bobrenjc93/690/orig -> origin/gh/bobrenjc93/690/orig 2025-12-04T08:56:59.0749104Z * [new branch] gh/bobrenjc93/691/base -> origin/gh/bobrenjc93/691/base 2025-12-04T08:56:59.0750880Z * [new branch] gh/bobrenjc93/691/head -> origin/gh/bobrenjc93/691/head 2025-12-04T08:56:59.0752793Z * [new branch] gh/bobrenjc93/691/orig -> origin/gh/bobrenjc93/691/orig 2025-12-04T08:56:59.0755505Z * [new branch] gh/bobrenjc93/692/base -> origin/gh/bobrenjc93/692/base 2025-12-04T08:56:59.0757191Z * [new branch] gh/bobrenjc93/692/head -> origin/gh/bobrenjc93/692/head 2025-12-04T08:56:59.0758750Z * [new branch] gh/bobrenjc93/692/orig -> origin/gh/bobrenjc93/692/orig 2025-12-04T08:56:59.0760716Z * [new branch] gh/bobrenjc93/693/base -> origin/gh/bobrenjc93/693/base 2025-12-04T08:56:59.0762403Z * [new branch] gh/bobrenjc93/693/head -> origin/gh/bobrenjc93/693/head 2025-12-04T08:56:59.0766918Z * [new branch] gh/bobrenjc93/693/orig -> origin/gh/bobrenjc93/693/orig 2025-12-04T08:56:59.0769146Z * [new branch] gh/bobrenjc93/694/base -> origin/gh/bobrenjc93/694/base 2025-12-04T08:56:59.0770735Z * [new branch] gh/bobrenjc93/694/head -> origin/gh/bobrenjc93/694/head 2025-12-04T08:56:59.0772424Z * [new branch] gh/bobrenjc93/694/orig -> origin/gh/bobrenjc93/694/orig 2025-12-04T08:56:59.0775127Z * [new branch] gh/bobrenjc93/695/base -> origin/gh/bobrenjc93/695/base 2025-12-04T08:56:59.0776822Z * [new branch] gh/bobrenjc93/695/head -> origin/gh/bobrenjc93/695/head 2025-12-04T08:56:59.0778391Z * [new branch] gh/bobrenjc93/695/orig -> origin/gh/bobrenjc93/695/orig 2025-12-04T08:56:59.0781077Z * [new branch] gh/c00w/23/base -> origin/gh/c00w/23/base 2025-12-04T08:56:59.0782783Z * [new branch] gh/c00w/23/head -> origin/gh/c00w/23/head 2025-12-04T08:56:59.0784880Z * [new branch] gh/c00w/53/base -> origin/gh/c00w/53/base 2025-12-04T08:56:59.0786426Z * [new branch] gh/c00w/53/head -> origin/gh/c00w/53/head 2025-12-04T08:56:59.0787950Z * [new branch] gh/c00w/53/orig -> origin/gh/c00w/53/orig 2025-12-04T08:56:59.0789951Z * [new branch] gh/c00w/54/base -> origin/gh/c00w/54/base 2025-12-04T08:56:59.0791532Z * [new branch] gh/c00w/54/head -> origin/gh/c00w/54/head 2025-12-04T08:56:59.0793078Z * [new branch] gh/c00w/54/orig -> origin/gh/c00w/54/orig 2025-12-04T08:56:59.0795235Z * [new branch] gh/c00w/56/base -> origin/gh/c00w/56/base 2025-12-04T08:56:59.0796756Z * [new branch] gh/c00w/56/head -> origin/gh/c00w/56/head 2025-12-04T08:56:59.0798335Z * [new branch] gh/c00w/56/orig -> origin/gh/c00w/56/orig 2025-12-04T08:56:59.0800623Z * [new branch] gh/c00w/57/base -> origin/gh/c00w/57/base 2025-12-04T08:56:59.0802096Z * [new branch] gh/c00w/57/head -> origin/gh/c00w/57/head 2025-12-04T08:56:59.0803930Z * [new branch] gh/c00w/57/orig -> origin/gh/c00w/57/orig 2025-12-04T08:56:59.0805957Z * [new branch] gh/c00w/58/base -> origin/gh/c00w/58/base 2025-12-04T08:56:59.0807545Z * [new branch] gh/c00w/58/head -> origin/gh/c00w/58/head 2025-12-04T08:56:59.0809080Z * [new branch] gh/c00w/58/orig -> origin/gh/c00w/58/orig 2025-12-04T08:56:59.0811690Z * [new branch] gh/clee2000/1/base -> origin/gh/clee2000/1/base 2025-12-04T08:56:59.0813579Z * [new branch] gh/clee2000/1/head -> origin/gh/clee2000/1/head 2025-12-04T08:56:59.0815054Z * [new branch] gh/clee2000/1/orig -> origin/gh/clee2000/1/orig 2025-12-04T08:56:59.0817709Z * [new branch] gh/coconutruben/1/base -> origin/gh/coconutruben/1/base 2025-12-04T08:56:59.0819406Z * [new branch] gh/coconutruben/1/head -> origin/gh/coconutruben/1/head 2025-12-04T08:56:59.0821725Z * [new branch] gh/coconutruben/55/base -> origin/gh/coconutruben/55/base 2025-12-04T08:56:59.0823306Z * [new branch] gh/coconutruben/55/head -> origin/gh/coconutruben/55/head 2025-12-04T08:56:59.0824922Z * [new branch] gh/coconutruben/55/orig -> origin/gh/coconutruben/55/orig 2025-12-04T08:56:59.0827144Z * [new branch] gh/coconutruben/57/base -> origin/gh/coconutruben/57/base 2025-12-04T08:56:59.0828817Z * [new branch] gh/coconutruben/57/head -> origin/gh/coconutruben/57/head 2025-12-04T08:56:59.0830463Z * [new branch] gh/coconutruben/57/orig -> origin/gh/coconutruben/57/orig 2025-12-04T08:56:59.0832576Z * [new branch] gh/coconutruben/70/base -> origin/gh/coconutruben/70/base 2025-12-04T08:56:59.0834104Z * [new branch] gh/coconutruben/70/head -> origin/gh/coconutruben/70/head 2025-12-04T08:56:59.0835692Z * [new branch] gh/coconutruben/70/orig -> origin/gh/coconutruben/70/orig 2025-12-04T08:56:59.0837695Z * [new branch] gh/coconutruben/71/base -> origin/gh/coconutruben/71/base 2025-12-04T08:56:59.0839313Z * [new branch] gh/coconutruben/71/head -> origin/gh/coconutruben/71/head 2025-12-04T08:56:59.0840952Z * [new branch] gh/coconutruben/71/orig -> origin/gh/coconutruben/71/orig 2025-12-04T08:56:59.0842941Z * [new branch] gh/coconutruben/72/base -> origin/gh/coconutruben/72/base 2025-12-04T08:56:59.0844523Z * [new branch] gh/coconutruben/72/head -> origin/gh/coconutruben/72/head 2025-12-04T08:56:59.0846059Z * [new branch] gh/coconutruben/72/orig -> origin/gh/coconutruben/72/orig 2025-12-04T08:56:59.0848018Z * [new branch] gh/coconutruben/73/base -> origin/gh/coconutruben/73/base 2025-12-04T08:56:59.0849626Z * [new branch] gh/coconutruben/73/head -> origin/gh/coconutruben/73/head 2025-12-04T08:56:59.0851336Z * [new branch] gh/coconutruben/73/orig -> origin/gh/coconutruben/73/orig 2025-12-04T08:56:59.0853574Z * [new branch] gh/coconutruben/74/base -> origin/gh/coconutruben/74/base 2025-12-04T08:56:59.0855189Z * [new branch] gh/coconutruben/74/head -> origin/gh/coconutruben/74/head 2025-12-04T08:56:59.0856775Z * [new branch] gh/coconutruben/74/orig -> origin/gh/coconutruben/74/orig 2025-12-04T08:56:59.0858889Z * [new branch] gh/coconutruben/79/base -> origin/gh/coconutruben/79/base 2025-12-04T08:56:59.0860647Z * [new branch] gh/coconutruben/79/head -> origin/gh/coconutruben/79/head 2025-12-04T08:56:59.0862650Z * [new branch] gh/coconutruben/79/orig -> origin/gh/coconutruben/79/orig 2025-12-04T08:56:59.0864710Z * [new branch] gh/coconutruben/80/base -> origin/gh/coconutruben/80/base 2025-12-04T08:56:59.0866294Z * [new branch] gh/coconutruben/80/head -> origin/gh/coconutruben/80/head 2025-12-04T08:56:59.0867903Z * [new branch] gh/coconutruben/80/orig -> origin/gh/coconutruben/80/orig 2025-12-04T08:56:59.0870302Z * [new branch] gh/coconutruben/82/base -> origin/gh/coconutruben/82/base 2025-12-04T08:56:59.0871835Z * [new branch] gh/coconutruben/82/head -> origin/gh/coconutruben/82/head 2025-12-04T08:56:59.0873430Z * [new branch] gh/coconutruben/82/orig -> origin/gh/coconutruben/82/orig 2025-12-04T08:56:59.0875490Z * [new branch] gh/coconutruben/83/base -> origin/gh/coconutruben/83/base 2025-12-04T08:56:59.0877144Z * [new branch] gh/coconutruben/83/head -> origin/gh/coconutruben/83/head 2025-12-04T08:56:59.0878645Z * [new branch] gh/coconutruben/83/orig -> origin/gh/coconutruben/83/orig 2025-12-04T08:56:59.0880752Z * [new branch] gh/coconutruben/84/base -> origin/gh/coconutruben/84/base 2025-12-04T08:56:59.0882491Z * [new branch] gh/coconutruben/84/head -> origin/gh/coconutruben/84/head 2025-12-04T08:56:59.0884166Z * [new branch] gh/coconutruben/84/orig -> origin/gh/coconutruben/84/orig 2025-12-04T08:56:59.0886205Z * [new branch] gh/coconutruben/85/base -> origin/gh/coconutruben/85/base 2025-12-04T08:56:59.0887960Z * [new branch] gh/coconutruben/85/head -> origin/gh/coconutruben/85/head 2025-12-04T08:56:59.0889424Z * [new branch] gh/coconutruben/85/orig -> origin/gh/coconutruben/85/orig 2025-12-04T08:56:59.0891568Z * [new branch] gh/coconutruben/86/base -> origin/gh/coconutruben/86/base 2025-12-04T08:56:59.0893258Z * [new branch] gh/coconutruben/86/head -> origin/gh/coconutruben/86/head 2025-12-04T08:56:59.0894770Z * [new branch] gh/coconutruben/86/orig -> origin/gh/coconutruben/86/orig 2025-12-04T08:56:59.0897285Z * [new branch] gh/colinchan15/1/base -> origin/gh/colinchan15/1/base 2025-12-04T08:56:59.0898921Z * [new branch] gh/colinchan15/1/head -> origin/gh/colinchan15/1/head 2025-12-04T08:56:59.0900791Z * [new branch] gh/colinchan15/2/base -> origin/gh/colinchan15/2/base 2025-12-04T08:56:59.0902311Z * [new branch] gh/colinchan15/2/head -> origin/gh/colinchan15/2/head 2025-12-04T08:56:59.0904432Z * [new branch] gh/colinchan15/3/base -> origin/gh/colinchan15/3/base 2025-12-04T08:56:59.0905979Z * [new branch] gh/colinchan15/3/head -> origin/gh/colinchan15/3/head 2025-12-04T08:56:59.0907967Z * [new branch] gh/colinchan15/6/base -> origin/gh/colinchan15/6/base 2025-12-04T08:56:59.0909529Z * [new branch] gh/colinchan15/6/head -> origin/gh/colinchan15/6/head 2025-12-04T08:56:59.0911956Z * [new branch] gh/d4l3k/1/base -> origin/gh/d4l3k/1/base 2025-12-04T08:56:59.0913605Z * [new branch] gh/d4l3k/1/head -> origin/gh/d4l3k/1/head 2025-12-04T08:56:59.0915541Z * [new branch] gh/d4l3k/2/base -> origin/gh/d4l3k/2/base 2025-12-04T08:56:59.0917202Z * [new branch] gh/d4l3k/2/head -> origin/gh/d4l3k/2/head 2025-12-04T08:56:59.0918867Z * [new branch] gh/d4l3k/2/orig -> origin/gh/d4l3k/2/orig 2025-12-04T08:56:59.0920834Z * [new branch] gh/d4l3k/3/base -> origin/gh/d4l3k/3/base 2025-12-04T08:56:59.0922512Z * [new branch] gh/d4l3k/3/head -> origin/gh/d4l3k/3/head 2025-12-04T08:56:59.0924066Z * [new branch] gh/d4l3k/3/orig -> origin/gh/d4l3k/3/orig 2025-12-04T08:56:59.0926137Z * [new branch] gh/d4l3k/4/base -> origin/gh/d4l3k/4/base 2025-12-04T08:56:59.0927694Z * [new branch] gh/d4l3k/4/head -> origin/gh/d4l3k/4/head 2025-12-04T08:56:59.0929241Z * [new branch] gh/d4l3k/4/orig -> origin/gh/d4l3k/4/orig 2025-12-04T08:56:59.0931125Z * [new branch] gh/d4l3k/5/base -> origin/gh/d4l3k/5/base 2025-12-04T08:56:59.0932621Z * [new branch] gh/d4l3k/5/orig -> origin/gh/d4l3k/5/orig 2025-12-04T08:56:59.0935253Z * [new branch] gh/davidberard98/392/base -> origin/gh/davidberard98/392/base 2025-12-04T08:56:59.0936759Z * [new branch] gh/davidberard98/392/head -> origin/gh/davidberard98/392/head 2025-12-04T08:56:59.0938304Z * [new branch] gh/davidberard98/392/orig -> origin/gh/davidberard98/392/orig 2025-12-04T08:56:59.0940475Z * [new branch] gh/davidberard98/399/base -> origin/gh/davidberard98/399/base 2025-12-04T08:56:59.0942162Z * [new branch] gh/davidberard98/399/head -> origin/gh/davidberard98/399/head 2025-12-04T08:56:59.0943704Z * [new branch] gh/davidberard98/399/orig -> origin/gh/davidberard98/399/orig 2025-12-04T08:56:59.0946198Z * [new branch] gh/desertfire/605/base -> origin/gh/desertfire/605/base 2025-12-04T08:56:59.0947767Z * [new branch] gh/desertfire/605/head -> origin/gh/desertfire/605/head 2025-12-04T08:56:59.0949357Z * [new branch] gh/desertfire/605/orig -> origin/gh/desertfire/605/orig 2025-12-04T08:56:59.0951455Z * [new branch] gh/desertfire/606/base -> origin/gh/desertfire/606/base 2025-12-04T08:56:59.0953035Z * [new branch] gh/desertfire/606/head -> origin/gh/desertfire/606/head 2025-12-04T08:56:59.0954738Z * [new branch] gh/desertfire/606/orig -> origin/gh/desertfire/606/orig 2025-12-04T08:56:59.0956771Z * [new branch] gh/desertfire/607/base -> origin/gh/desertfire/607/base 2025-12-04T08:56:59.0958303Z * [new branch] gh/desertfire/607/head -> origin/gh/desertfire/607/head 2025-12-04T08:56:59.0959911Z * [new branch] gh/desertfire/607/orig -> origin/gh/desertfire/607/orig 2025-12-04T08:56:59.0962170Z * [new branch] gh/desertfire/608/base -> origin/gh/desertfire/608/base 2025-12-04T08:56:59.0963788Z * [new branch] gh/desertfire/608/head -> origin/gh/desertfire/608/head 2025-12-04T08:56:59.0965342Z * [new branch] gh/desertfire/608/orig -> origin/gh/desertfire/608/orig 2025-12-04T08:56:59.0967434Z * [new branch] gh/desertfire/609/base -> origin/gh/desertfire/609/base 2025-12-04T08:56:59.0968962Z * [new branch] gh/desertfire/609/head -> origin/gh/desertfire/609/head 2025-12-04T08:56:59.0970543Z * [new branch] gh/desertfire/609/orig -> origin/gh/desertfire/609/orig 2025-12-04T08:56:59.0973064Z * [new branch] gh/desertfire/610/base -> origin/gh/desertfire/610/base 2025-12-04T08:56:59.0974622Z * [new branch] gh/desertfire/610/head -> origin/gh/desertfire/610/head 2025-12-04T08:56:59.0976194Z * [new branch] gh/desertfire/610/orig -> origin/gh/desertfire/610/orig 2025-12-04T08:56:59.0978197Z * [new branch] gh/desertfire/611/base -> origin/gh/desertfire/611/base 2025-12-04T08:56:59.0979816Z * [new branch] gh/desertfire/611/head -> origin/gh/desertfire/611/head 2025-12-04T08:56:59.0981408Z * [new branch] gh/desertfire/611/orig -> origin/gh/desertfire/611/orig 2025-12-04T08:56:59.0983564Z * [new branch] gh/desertfire/612/base -> origin/gh/desertfire/612/base 2025-12-04T08:56:59.0985193Z * [new branch] gh/desertfire/612/head -> origin/gh/desertfire/612/head 2025-12-04T08:56:59.0986711Z * [new branch] gh/desertfire/612/orig -> origin/gh/desertfire/612/orig 2025-12-04T08:56:59.0989022Z * [new branch] gh/desertfire/613/base -> origin/gh/desertfire/613/base 2025-12-04T08:56:59.0990692Z * [new branch] gh/desertfire/613/head -> origin/gh/desertfire/613/head 2025-12-04T08:56:59.0992206Z * [new branch] gh/desertfire/613/orig -> origin/gh/desertfire/613/orig 2025-12-04T08:56:59.0994451Z * [new branch] gh/desertfire/614/base -> origin/gh/desertfire/614/base 2025-12-04T08:56:59.0996111Z * [new branch] gh/desertfire/614/head -> origin/gh/desertfire/614/head 2025-12-04T08:56:59.0997687Z * [new branch] gh/desertfire/614/orig -> origin/gh/desertfire/614/orig 2025-12-04T08:56:59.0999792Z * [new branch] gh/desertfire/615/base -> origin/gh/desertfire/615/base 2025-12-04T08:56:59.1001573Z * [new branch] gh/desertfire/615/head -> origin/gh/desertfire/615/head 2025-12-04T08:56:59.1003116Z * [new branch] gh/desertfire/615/orig -> origin/gh/desertfire/615/orig 2025-12-04T08:56:59.1005152Z * [new branch] gh/desertfire/616/base -> origin/gh/desertfire/616/base 2025-12-04T08:56:59.1006781Z * [new branch] gh/desertfire/616/head -> origin/gh/desertfire/616/head 2025-12-04T08:56:59.1008274Z * [new branch] gh/desertfire/616/orig -> origin/gh/desertfire/616/orig 2025-12-04T08:56:59.1010315Z * [new branch] gh/desertfire/617/base -> origin/gh/desertfire/617/base 2025-12-04T08:56:59.1011961Z * [new branch] gh/desertfire/617/head -> origin/gh/desertfire/617/head 2025-12-04T08:56:59.1013620Z * [new branch] gh/desertfire/617/orig -> origin/gh/desertfire/617/orig 2025-12-04T08:56:59.1016164Z * [new branch] gh/dharakk/1/base -> origin/gh/dharakk/1/base 2025-12-04T08:56:59.1017765Z * [new branch] gh/dharakk/1/head -> origin/gh/dharakk/1/head 2025-12-04T08:56:59.1020288Z * [new branch] gh/drisspg/170/base -> origin/gh/drisspg/170/base 2025-12-04T08:56:59.1021825Z * [new branch] gh/drisspg/170/head -> origin/gh/drisspg/170/head 2025-12-04T08:56:59.1023391Z * [new branch] gh/drisspg/170/orig -> origin/gh/drisspg/170/orig 2025-12-04T08:56:59.1025523Z * [new branch] gh/drisspg/182/base -> origin/gh/drisspg/182/base 2025-12-04T08:56:59.1027104Z * [new branch] gh/drisspg/182/head -> origin/gh/drisspg/182/head 2025-12-04T08:56:59.1029147Z * [new branch] gh/drisspg/183/base -> origin/gh/drisspg/183/base 2025-12-04T08:56:59.1030640Z * [new branch] gh/drisspg/183/head -> origin/gh/drisspg/183/head 2025-12-04T08:56:59.1032591Z * [new branch] gh/drisspg/184/base -> origin/gh/drisspg/184/base 2025-12-04T08:56:59.1034046Z * [new branch] gh/drisspg/184/head -> origin/gh/drisspg/184/head 2025-12-04T08:56:59.1036147Z * [new branch] gh/drisspg/185/base -> origin/gh/drisspg/185/base 2025-12-04T08:56:59.1037708Z * [new branch] gh/drisspg/185/head -> origin/gh/drisspg/185/head 2025-12-04T08:56:59.1039729Z * [new branch] gh/drisspg/194/base -> origin/gh/drisspg/194/base 2025-12-04T08:56:59.1041385Z * [new branch] gh/drisspg/194/head -> origin/gh/drisspg/194/head 2025-12-04T08:56:59.1042875Z * [new branch] gh/drisspg/194/orig -> origin/gh/drisspg/194/orig 2025-12-04T08:56:59.1045036Z * [new branch] gh/drisspg/200/base -> origin/gh/drisspg/200/base 2025-12-04T08:56:59.1046582Z * [new branch] gh/drisspg/200/head -> origin/gh/drisspg/200/head 2025-12-04T08:56:59.1048094Z * [new branch] gh/drisspg/200/orig -> origin/gh/drisspg/200/orig 2025-12-04T08:56:59.1050175Z * [new branch] gh/drisspg/218/base -> origin/gh/drisspg/218/base 2025-12-04T08:56:59.1051799Z * [new branch] gh/drisspg/218/head -> origin/gh/drisspg/218/head 2025-12-04T08:56:59.1053380Z * [new branch] gh/drisspg/218/orig -> origin/gh/drisspg/218/orig 2025-12-04T08:56:59.1055481Z * [new branch] gh/drisspg/219/base -> origin/gh/drisspg/219/base 2025-12-04T08:56:59.1057275Z * [new branch] gh/drisspg/219/head -> origin/gh/drisspg/219/head 2025-12-04T08:56:59.1071820Z * [new branch] gh/drisspg/219/orig -> origin/gh/drisspg/219/orig 2025-12-04T08:56:59.1072092Z * [new branch] gh/drisspg/220/base -> origin/gh/drisspg/220/base 2025-12-04T08:56:59.1072286Z * [new branch] gh/drisspg/220/head -> origin/gh/drisspg/220/head 2025-12-04T08:56:59.1072441Z * [new branch] gh/drisspg/220/orig -> origin/gh/drisspg/220/orig 2025-12-04T08:56:59.1072594Z * [new branch] gh/drisspg/221/base -> origin/gh/drisspg/221/base 2025-12-04T08:56:59.1072750Z * [new branch] gh/drisspg/221/head -> origin/gh/drisspg/221/head 2025-12-04T08:56:59.1072894Z * [new branch] gh/drisspg/221/orig -> origin/gh/drisspg/221/orig 2025-12-04T08:56:59.1073042Z * [new branch] gh/drisspg/222/base -> origin/gh/drisspg/222/base 2025-12-04T08:56:59.1073216Z * [new branch] gh/drisspg/222/head -> origin/gh/drisspg/222/head 2025-12-04T08:56:59.1075217Z * [new branch] gh/drisspg/222/orig -> origin/gh/drisspg/222/orig 2025-12-04T08:56:59.1077277Z * [new branch] gh/drisspg/223/base -> origin/gh/drisspg/223/base 2025-12-04T08:56:59.1078629Z * [new branch] gh/drisspg/223/head -> origin/gh/drisspg/223/head 2025-12-04T08:56:59.1080233Z * [new branch] gh/drisspg/223/orig -> origin/gh/drisspg/223/orig 2025-12-04T08:56:59.1082374Z * [new branch] gh/drisspg/224/base -> origin/gh/drisspg/224/base 2025-12-04T08:56:59.1083792Z * [new branch] gh/drisspg/224/head -> origin/gh/drisspg/224/head 2025-12-04T08:56:59.1085389Z * [new branch] gh/drisspg/224/orig -> origin/gh/drisspg/224/orig 2025-12-04T08:56:59.1087499Z * [new branch] gh/drisspg/225/base -> origin/gh/drisspg/225/base 2025-12-04T08:56:59.1089126Z * [new branch] gh/drisspg/225/head -> origin/gh/drisspg/225/head 2025-12-04T08:56:59.1090738Z * [new branch] gh/drisspg/225/orig -> origin/gh/drisspg/225/orig 2025-12-04T08:56:59.1092808Z * [new branch] gh/drisspg/226/base -> origin/gh/drisspg/226/base 2025-12-04T08:56:59.1094461Z * [new branch] gh/drisspg/226/head -> origin/gh/drisspg/226/head 2025-12-04T08:56:59.1096057Z * [new branch] gh/drisspg/226/orig -> origin/gh/drisspg/226/orig 2025-12-04T08:56:59.1098545Z * [new branch] gh/drisspg/227/base -> origin/gh/drisspg/227/base 2025-12-04T08:56:59.1100135Z * [new branch] gh/drisspg/227/head -> origin/gh/drisspg/227/head 2025-12-04T08:56:59.1101636Z * [new branch] gh/drisspg/227/orig -> origin/gh/drisspg/227/orig 2025-12-04T08:56:59.1103799Z * [new branch] gh/drisspg/228/base -> origin/gh/drisspg/228/base 2025-12-04T08:56:59.1105398Z * [new branch] gh/drisspg/228/head -> origin/gh/drisspg/228/head 2025-12-04T08:56:59.1106914Z * [new branch] gh/drisspg/228/orig -> origin/gh/drisspg/228/orig 2025-12-04T08:56:59.1109079Z * [new branch] gh/drisspg/229/base -> origin/gh/drisspg/229/base 2025-12-04T08:56:59.1110727Z * [new branch] gh/drisspg/229/head -> origin/gh/drisspg/229/head 2025-12-04T08:56:59.1112284Z * [new branch] gh/drisspg/229/orig -> origin/gh/drisspg/229/orig 2025-12-04T08:56:59.1114562Z * [new branch] gh/drisspg/230/base -> origin/gh/drisspg/230/base 2025-12-04T08:56:59.1115990Z * [new branch] gh/drisspg/230/head -> origin/gh/drisspg/230/head 2025-12-04T08:56:59.1117530Z * [new branch] gh/drisspg/230/orig -> origin/gh/drisspg/230/orig 2025-12-04T08:56:59.1120042Z * [new branch] gh/dsjohns2/1/base -> origin/gh/dsjohns2/1/base 2025-12-04T08:56:59.1121711Z * [new branch] gh/dsjohns2/1/head -> origin/gh/dsjohns2/1/head 2025-12-04T08:56:59.1124182Z * [new branch] gh/dzmitry-huba/1/base -> origin/gh/dzmitry-huba/1/base 2025-12-04T08:56:59.1125874Z * [new branch] gh/dzmitry-huba/1/head -> origin/gh/dzmitry-huba/1/head 2025-12-04T08:56:59.1128115Z * [new branch] gh/dzmitry-huba/12/base -> origin/gh/dzmitry-huba/12/base 2025-12-04T08:56:59.1129784Z * [new branch] gh/dzmitry-huba/12/head -> origin/gh/dzmitry-huba/12/head 2025-12-04T08:56:59.1131484Z * [new branch] gh/dzmitry-huba/12/orig -> origin/gh/dzmitry-huba/12/orig 2025-12-04T08:56:59.1133723Z * [new branch] gh/dzmitry-huba/13/base -> origin/gh/dzmitry-huba/13/base 2025-12-04T08:56:59.1135389Z * [new branch] gh/dzmitry-huba/13/head -> origin/gh/dzmitry-huba/13/head 2025-12-04T08:56:59.1136976Z * [new branch] gh/dzmitry-huba/13/orig -> origin/gh/dzmitry-huba/13/orig 2025-12-04T08:56:59.1138871Z * [new branch] gh/dzmitry-huba/14/base -> origin/gh/dzmitry-huba/14/base 2025-12-04T08:56:59.1140559Z * [new branch] gh/dzmitry-huba/14/head -> origin/gh/dzmitry-huba/14/head 2025-12-04T08:56:59.1142158Z * [new branch] gh/dzmitry-huba/14/orig -> origin/gh/dzmitry-huba/14/orig 2025-12-04T08:56:59.1144236Z * [new branch] gh/dzmitry-huba/15/base -> origin/gh/dzmitry-huba/15/base 2025-12-04T08:56:59.1145897Z * [new branch] gh/dzmitry-huba/15/head -> origin/gh/dzmitry-huba/15/head 2025-12-04T08:56:59.1147427Z * [new branch] gh/dzmitry-huba/15/orig -> origin/gh/dzmitry-huba/15/orig 2025-12-04T08:56:59.1149604Z * [new branch] gh/dzmitry-huba/16/base -> origin/gh/dzmitry-huba/16/base 2025-12-04T08:56:59.1151269Z * [new branch] gh/dzmitry-huba/16/head -> origin/gh/dzmitry-huba/16/head 2025-12-04T08:56:59.1153088Z * [new branch] gh/dzmitry-huba/16/orig -> origin/gh/dzmitry-huba/16/orig 2025-12-04T08:56:59.1155460Z * [new branch] gh/dzmitry-huba/17/base -> origin/gh/dzmitry-huba/17/base 2025-12-04T08:56:59.1157140Z * [new branch] gh/dzmitry-huba/17/head -> origin/gh/dzmitry-huba/17/head 2025-12-04T08:56:59.1158623Z * [new branch] gh/dzmitry-huba/17/orig -> origin/gh/dzmitry-huba/17/orig 2025-12-04T08:56:59.1160542Z * [new branch] gh/dzmitry-huba/2/base -> origin/gh/dzmitry-huba/2/base 2025-12-04T08:56:59.1162214Z * [new branch] gh/dzmitry-huba/2/head -> origin/gh/dzmitry-huba/2/head 2025-12-04T08:56:59.1166195Z * [new branch] gh/dzmitry-huba/3/base -> origin/gh/dzmitry-huba/3/base 2025-12-04T08:56:59.1167729Z * [new branch] gh/dzmitry-huba/3/head -> origin/gh/dzmitry-huba/3/head 2025-12-04T08:56:59.1170403Z * [new branch] gh/eellison/808/base -> origin/gh/eellison/808/base 2025-12-04T08:56:59.1171995Z * [new branch] gh/eellison/808/head -> origin/gh/eellison/808/head 2025-12-04T08:56:59.1173796Z * [new branch] gh/eellison/808/orig -> origin/gh/eellison/808/orig 2025-12-04T08:56:59.1176026Z * [new branch] gh/eellison/822/base -> origin/gh/eellison/822/base 2025-12-04T08:56:59.1177649Z * [new branch] gh/eellison/822/head -> origin/gh/eellison/822/head 2025-12-04T08:56:59.1180213Z * [new branch] gh/eellison/822/orig -> origin/gh/eellison/822/orig 2025-12-04T08:56:59.1182403Z * [new branch] gh/eellison/823/base -> origin/gh/eellison/823/base 2025-12-04T08:56:59.1183840Z * [new branch] gh/eellison/823/head -> origin/gh/eellison/823/head 2025-12-04T08:56:59.1185414Z * [new branch] gh/eellison/823/orig -> origin/gh/eellison/823/orig 2025-12-04T08:56:59.1187448Z * [new branch] gh/eellison/862/base -> origin/gh/eellison/862/base 2025-12-04T08:56:59.1189004Z * [new branch] gh/eellison/862/head -> origin/gh/eellison/862/head 2025-12-04T08:56:59.1190561Z * [new branch] gh/eellison/862/orig -> origin/gh/eellison/862/orig 2025-12-04T08:56:59.1192622Z * [new branch] gh/eellison/863/base -> origin/gh/eellison/863/base 2025-12-04T08:56:59.1194248Z * [new branch] gh/eellison/863/head -> origin/gh/eellison/863/head 2025-12-04T08:56:59.1195818Z * [new branch] gh/eellison/863/orig -> origin/gh/eellison/863/orig 2025-12-04T08:56:59.1197778Z * [new branch] gh/eellison/864/base -> origin/gh/eellison/864/base 2025-12-04T08:56:59.1199456Z * [new branch] gh/eellison/864/head -> origin/gh/eellison/864/head 2025-12-04T08:56:59.1201125Z * [new branch] gh/eellison/864/orig -> origin/gh/eellison/864/orig 2025-12-04T08:56:59.1203055Z * [new branch] gh/eellison/865/base -> origin/gh/eellison/865/base 2025-12-04T08:56:59.1204691Z * [new branch] gh/eellison/865/head -> origin/gh/eellison/865/head 2025-12-04T08:56:59.1206277Z * [new branch] gh/eellison/865/orig -> origin/gh/eellison/865/orig 2025-12-04T08:56:59.1208255Z * [new branch] gh/eellison/866/base -> origin/gh/eellison/866/base 2025-12-04T08:56:59.1210109Z * [new branch] gh/eellison/866/head -> origin/gh/eellison/866/head 2025-12-04T08:56:59.1211716Z * [new branch] gh/eellison/866/orig -> origin/gh/eellison/866/orig 2025-12-04T08:56:59.1213982Z * [new branch] gh/eellison/867/base -> origin/gh/eellison/867/base 2025-12-04T08:56:59.1215626Z * [new branch] gh/eellison/867/head -> origin/gh/eellison/867/head 2025-12-04T08:56:59.1216934Z * [new branch] gh/eellison/867/orig -> origin/gh/eellison/867/orig 2025-12-04T08:56:59.1219204Z * [new branch] gh/eellison/868/base -> origin/gh/eellison/868/base 2025-12-04T08:56:59.1221046Z * [new branch] gh/eellison/868/head -> origin/gh/eellison/868/head 2025-12-04T08:56:59.1222628Z * [new branch] gh/eellison/868/orig -> origin/gh/eellison/868/orig 2025-12-04T08:56:59.1224631Z * [new branch] gh/eellison/869/base -> origin/gh/eellison/869/base 2025-12-04T08:56:59.1226273Z * [new branch] gh/eellison/869/head -> origin/gh/eellison/869/head 2025-12-04T08:56:59.1227784Z * [new branch] gh/eellison/869/orig -> origin/gh/eellison/869/orig 2025-12-04T08:56:59.1229838Z * [new branch] gh/eellison/870/base -> origin/gh/eellison/870/base 2025-12-04T08:56:59.1231443Z * [new branch] gh/eellison/870/head -> origin/gh/eellison/870/head 2025-12-04T08:56:59.1232928Z * [new branch] gh/eellison/870/orig -> origin/gh/eellison/870/orig 2025-12-04T08:56:59.1235092Z * [new branch] gh/eellison/871/base -> origin/gh/eellison/871/base 2025-12-04T08:56:59.1236682Z * [new branch] gh/eellison/871/head -> origin/gh/eellison/871/head 2025-12-04T08:56:59.1238222Z * [new branch] gh/eellison/871/orig -> origin/gh/eellison/871/orig 2025-12-04T08:56:59.1240901Z * [new branch] gh/eellison/872/base -> origin/gh/eellison/872/base 2025-12-04T08:56:59.1242611Z * [new branch] gh/eellison/872/head -> origin/gh/eellison/872/head 2025-12-04T08:56:59.1244094Z * [new branch] gh/eellison/872/orig -> origin/gh/eellison/872/orig 2025-12-04T08:56:59.1246321Z * [new branch] gh/eellison/873/base -> origin/gh/eellison/873/base 2025-12-04T08:56:59.1247890Z * [new branch] gh/eellison/873/head -> origin/gh/eellison/873/head 2025-12-04T08:56:59.1249438Z * [new branch] gh/eellison/873/orig -> origin/gh/eellison/873/orig 2025-12-04T08:56:59.1251866Z * [new branch] gh/eellison/874/base -> origin/gh/eellison/874/base 2025-12-04T08:56:59.1253060Z * [new branch] gh/eellison/874/head -> origin/gh/eellison/874/head 2025-12-04T08:56:59.1254738Z * [new branch] gh/eellison/874/orig -> origin/gh/eellison/874/orig 2025-12-04T08:56:59.1257564Z * [new branch] gh/eellison/875/base -> origin/gh/eellison/875/base 2025-12-04T08:56:59.1259187Z * [new branch] gh/eellison/875/head -> origin/gh/eellison/875/head 2025-12-04T08:56:59.1260743Z * [new branch] gh/eellison/875/orig -> origin/gh/eellison/875/orig 2025-12-04T08:56:59.1263133Z * [new branch] gh/eellison/876/base -> origin/gh/eellison/876/base 2025-12-04T08:56:59.1264726Z * [new branch] gh/eellison/876/head -> origin/gh/eellison/876/head 2025-12-04T08:56:59.1266310Z * [new branch] gh/eellison/876/orig -> origin/gh/eellison/876/orig 2025-12-04T08:56:59.1268497Z * [new branch] gh/eellison/877/base -> origin/gh/eellison/877/base 2025-12-04T08:56:59.1270090Z * [new branch] gh/eellison/877/head -> origin/gh/eellison/877/head 2025-12-04T08:56:59.1271598Z * [new branch] gh/eellison/877/orig -> origin/gh/eellison/877/orig 2025-12-04T08:56:59.1273722Z * [new branch] gh/eellison/878/base -> origin/gh/eellison/878/base 2025-12-04T08:56:59.1275283Z * [new branch] gh/eellison/878/head -> origin/gh/eellison/878/head 2025-12-04T08:56:59.1276816Z * [new branch] gh/eellison/878/orig -> origin/gh/eellison/878/orig 2025-12-04T08:56:59.1279025Z * [new branch] gh/eellison/879/base -> origin/gh/eellison/879/base 2025-12-04T08:56:59.1280658Z * [new branch] gh/eellison/879/head -> origin/gh/eellison/879/head 2025-12-04T08:56:59.1282260Z * [new branch] gh/eellison/879/orig -> origin/gh/eellison/879/orig 2025-12-04T08:56:59.1284247Z * [new branch] gh/eellison/880/base -> origin/gh/eellison/880/base 2025-12-04T08:56:59.1285844Z * [new branch] gh/eellison/880/head -> origin/gh/eellison/880/head 2025-12-04T08:56:59.1287446Z * [new branch] gh/eellison/880/orig -> origin/gh/eellison/880/orig 2025-12-04T08:56:59.1289648Z * [new branch] gh/eellison/881/base -> origin/gh/eellison/881/base 2025-12-04T08:56:59.1291192Z * [new branch] gh/eellison/881/head -> origin/gh/eellison/881/head 2025-12-04T08:56:59.1292810Z * [new branch] gh/eellison/881/orig -> origin/gh/eellison/881/orig 2025-12-04T08:56:59.1294948Z * [new branch] gh/eellison/882/base -> origin/gh/eellison/882/base 2025-12-04T08:56:59.1296478Z * [new branch] gh/eellison/882/head -> origin/gh/eellison/882/head 2025-12-04T08:56:59.1298131Z * [new branch] gh/eellison/882/orig -> origin/gh/eellison/882/orig 2025-12-04T08:56:59.1300297Z * [new branch] gh/eellison/883/base -> origin/gh/eellison/883/base 2025-12-04T08:56:59.1301859Z * [new branch] gh/eellison/883/head -> origin/gh/eellison/883/head 2025-12-04T08:56:59.1303371Z * [new branch] gh/eellison/883/orig -> origin/gh/eellison/883/orig 2025-12-04T08:56:59.1305528Z * [new branch] gh/eellison/884/base -> origin/gh/eellison/884/base 2025-12-04T08:56:59.1307088Z * [new branch] gh/eellison/884/head -> origin/gh/eellison/884/head 2025-12-04T08:56:59.1308529Z * [new branch] gh/eellison/884/orig -> origin/gh/eellison/884/orig 2025-12-04T08:56:59.1311008Z * [new branch] gh/etaf/147/base -> origin/gh/etaf/147/base 2025-12-04T08:56:59.1312773Z * [new branch] gh/etaf/147/head -> origin/gh/etaf/147/head 2025-12-04T08:56:59.1314929Z * [new branch] gh/etaf/154/base -> origin/gh/etaf/154/base 2025-12-04T08:56:59.1316502Z * [new branch] gh/etaf/154/head -> origin/gh/etaf/154/head 2025-12-04T08:56:59.1318065Z * [new branch] gh/etaf/154/orig -> origin/gh/etaf/154/orig 2025-12-04T08:56:59.1320233Z * [new branch] gh/etaf/156/base -> origin/gh/etaf/156/base 2025-12-04T08:56:59.1321762Z * [new branch] gh/etaf/156/head -> origin/gh/etaf/156/head 2025-12-04T08:56:59.1323302Z * [new branch] gh/etaf/156/orig -> origin/gh/etaf/156/orig 2025-12-04T08:56:59.1325459Z * [new branch] gh/etaf/157/base -> origin/gh/etaf/157/base 2025-12-04T08:56:59.1327045Z * [new branch] gh/etaf/157/head -> origin/gh/etaf/157/head 2025-12-04T08:56:59.1328597Z * [new branch] gh/etaf/157/orig -> origin/gh/etaf/157/orig 2025-12-04T08:56:59.1330598Z * [new branch] gh/etaf/158/base -> origin/gh/etaf/158/base 2025-12-04T08:56:59.1332326Z * [new branch] gh/etaf/158/head -> origin/gh/etaf/158/head 2025-12-04T08:56:59.1334002Z * [new branch] gh/etaf/158/orig -> origin/gh/etaf/158/orig 2025-12-04T08:56:59.1336148Z * [new branch] gh/etaf/159/base -> origin/gh/etaf/159/base 2025-12-04T08:56:59.1337748Z * [new branch] gh/etaf/159/head -> origin/gh/etaf/159/head 2025-12-04T08:56:59.1339321Z * [new branch] gh/etaf/159/orig -> origin/gh/etaf/159/orig 2025-12-04T08:56:59.1341565Z * [new branch] gh/etaf/160/base -> origin/gh/etaf/160/base 2025-12-04T08:56:59.1343207Z * [new branch] gh/etaf/160/head -> origin/gh/etaf/160/head 2025-12-04T08:56:59.1344795Z * [new branch] gh/etaf/160/orig -> origin/gh/etaf/160/orig 2025-12-04T08:56:59.1346908Z * [new branch] gh/etaf/161/base -> origin/gh/etaf/161/base 2025-12-04T08:56:59.1348575Z * [new branch] gh/etaf/161/head -> origin/gh/etaf/161/head 2025-12-04T08:56:59.1350209Z * [new branch] gh/etaf/161/orig -> origin/gh/etaf/161/orig 2025-12-04T08:56:59.1352172Z * [new branch] gh/etaf/166/base -> origin/gh/etaf/166/base 2025-12-04T08:56:59.1353901Z * [new branch] gh/etaf/166/head -> origin/gh/etaf/166/head 2025-12-04T08:56:59.1355511Z * [new branch] gh/etaf/166/orig -> origin/gh/etaf/166/orig 2025-12-04T08:56:59.1357458Z * [new branch] gh/etaf/167/base -> origin/gh/etaf/167/base 2025-12-04T08:56:59.1359043Z * [new branch] gh/etaf/167/head -> origin/gh/etaf/167/head 2025-12-04T08:56:59.1360602Z * [new branch] gh/etaf/167/orig -> origin/gh/etaf/167/orig 2025-12-04T08:56:59.1363271Z * [new branch] gh/etaf/168/base -> origin/gh/etaf/168/base 2025-12-04T08:56:59.1364914Z * [new branch] gh/etaf/168/head -> origin/gh/etaf/168/head 2025-12-04T08:56:59.1366470Z * [new branch] gh/etaf/168/orig -> origin/gh/etaf/168/orig 2025-12-04T08:56:59.1368655Z * [new branch] gh/etaf/172/base -> origin/gh/etaf/172/base 2025-12-04T08:56:59.1370292Z * [new branch] gh/etaf/172/head -> origin/gh/etaf/172/head 2025-12-04T08:56:59.1372050Z * [new branch] gh/etaf/172/orig -> origin/gh/etaf/172/orig 2025-12-04T08:56:59.1374271Z * [new branch] gh/etaf/173/base -> origin/gh/etaf/173/base 2025-12-04T08:56:59.1375956Z * [new branch] gh/etaf/173/head -> origin/gh/etaf/173/head 2025-12-04T08:56:59.1377492Z * [new branch] gh/etaf/173/orig -> origin/gh/etaf/173/orig 2025-12-04T08:56:59.1379537Z * [new branch] gh/etaf/174/base -> origin/gh/etaf/174/base 2025-12-04T08:56:59.1381146Z * [new branch] gh/etaf/174/head -> origin/gh/etaf/174/head 2025-12-04T08:56:59.1383215Z * [new branch] gh/etaf/175/base -> origin/gh/etaf/175/base 2025-12-04T08:56:59.1384830Z * [new branch] gh/etaf/175/head -> origin/gh/etaf/175/head 2025-12-04T08:56:59.1386364Z * [new branch] gh/etaf/175/orig -> origin/gh/etaf/175/orig 2025-12-04T08:56:59.1388534Z * [new branch] gh/etaf/176/base -> origin/gh/etaf/176/base 2025-12-04T08:56:59.1390140Z * [new branch] gh/etaf/176/head -> origin/gh/etaf/176/head 2025-12-04T08:56:59.1391721Z * [new branch] gh/etaf/176/orig -> origin/gh/etaf/176/orig 2025-12-04T08:56:59.1394344Z * [new branch] gh/etaf/177/base -> origin/gh/etaf/177/base 2025-12-04T08:56:59.1395992Z * [new branch] gh/etaf/177/head -> origin/gh/etaf/177/head 2025-12-04T08:56:59.1397535Z * [new branch] gh/etaf/177/orig -> origin/gh/etaf/177/orig 2025-12-04T08:56:59.1399744Z * [new branch] gh/etaf/178/base -> origin/gh/etaf/178/base 2025-12-04T08:56:59.1401437Z * [new branch] gh/etaf/178/head -> origin/gh/etaf/178/head 2025-12-04T08:56:59.1403055Z * [new branch] gh/etaf/178/orig -> origin/gh/etaf/178/orig 2025-12-04T08:56:59.1405297Z * [new branch] gh/etaf/179/base -> origin/gh/etaf/179/base 2025-12-04T08:56:59.1406978Z * [new branch] gh/etaf/179/head -> origin/gh/etaf/179/head 2025-12-04T08:56:59.1408512Z * [new branch] gh/etaf/179/orig -> origin/gh/etaf/179/orig 2025-12-04T08:56:59.1410500Z * [new branch] gh/etaf/180/base -> origin/gh/etaf/180/base 2025-12-04T08:56:59.1412153Z * [new branch] gh/etaf/180/head -> origin/gh/etaf/180/head 2025-12-04T08:56:59.1414223Z * [new branch] gh/etaf/180/orig -> origin/gh/etaf/180/orig 2025-12-04T08:56:59.1416771Z * [new branch] gh/exclamaforte/1/base -> origin/gh/exclamaforte/1/base 2025-12-04T08:56:59.1418369Z * [new branch] gh/exclamaforte/1/head -> origin/gh/exclamaforte/1/head 2025-12-04T08:56:59.1420403Z * [new branch] gh/exclamaforte/2/base -> origin/gh/exclamaforte/2/base 2025-12-04T08:56:59.1422010Z * [new branch] gh/exclamaforte/2/head -> origin/gh/exclamaforte/2/head 2025-12-04T08:56:59.1424671Z * [new branch] gh/exclamaforte/3/base -> origin/gh/exclamaforte/3/base 2025-12-04T08:56:59.1427182Z * [new branch] gh/exclamaforte/3/head -> origin/gh/exclamaforte/3/head 2025-12-04T08:56:59.1429291Z * [new branch] gh/exclamaforte/4/base -> origin/gh/exclamaforte/4/base 2025-12-04T08:56:59.1431355Z * [new branch] gh/exclamaforte/4/head -> origin/gh/exclamaforte/4/head 2025-12-04T08:56:59.1434050Z * [new branch] gh/ezyang/2374/base -> origin/gh/ezyang/2374/base 2025-12-04T08:56:59.1435616Z * [new branch] gh/ezyang/2374/head -> origin/gh/ezyang/2374/head 2025-12-04T08:56:59.1437157Z * [new branch] gh/ezyang/2374/orig -> origin/gh/ezyang/2374/orig 2025-12-04T08:56:59.1439336Z * [new branch] gh/ezyang/2973/base -> origin/gh/ezyang/2973/base 2025-12-04T08:56:59.1440854Z * [new branch] gh/ezyang/2973/head -> origin/gh/ezyang/2973/head 2025-12-04T08:56:59.1442429Z * [new branch] gh/ezyang/2973/orig -> origin/gh/ezyang/2973/orig 2025-12-04T08:56:59.1444491Z * [new branch] gh/ezyang/2974/base -> origin/gh/ezyang/2974/base 2025-12-04T08:56:59.1446108Z * [new branch] gh/ezyang/2974/head -> origin/gh/ezyang/2974/head 2025-12-04T08:56:59.1447672Z * [new branch] gh/ezyang/2974/orig -> origin/gh/ezyang/2974/orig 2025-12-04T08:56:59.1449816Z * [new branch] gh/ezyang/3131/base -> origin/gh/ezyang/3131/base 2025-12-04T08:56:59.1451394Z * [new branch] gh/ezyang/3131/head -> origin/gh/ezyang/3131/head 2025-12-04T08:56:59.1453150Z * [new branch] gh/ezyang/3131/orig -> origin/gh/ezyang/3131/orig 2025-12-04T08:56:59.1455140Z * [new branch] gh/ezyang/3139/base -> origin/gh/ezyang/3139/base 2025-12-04T08:56:59.1456704Z * [new branch] gh/ezyang/3139/head -> origin/gh/ezyang/3139/head 2025-12-04T08:56:59.1458292Z * [new branch] gh/ezyang/3139/orig -> origin/gh/ezyang/3139/orig 2025-12-04T08:56:59.1460226Z * [new branch] gh/ezyang/3140/base -> origin/gh/ezyang/3140/base 2025-12-04T08:56:59.1461956Z * [new branch] gh/ezyang/3140/head -> origin/gh/ezyang/3140/head 2025-12-04T08:56:59.1463648Z * [new branch] gh/ezyang/3140/orig -> origin/gh/ezyang/3140/orig 2025-12-04T08:56:59.1465733Z * [new branch] gh/ezyang/3143/base -> origin/gh/ezyang/3143/base 2025-12-04T08:56:59.1467271Z * [new branch] gh/ezyang/3143/head -> origin/gh/ezyang/3143/head 2025-12-04T08:56:59.1468812Z * [new branch] gh/ezyang/3143/orig -> origin/gh/ezyang/3143/orig 2025-12-04T08:56:59.1470984Z * [new branch] gh/ezyang/3144/base -> origin/gh/ezyang/3144/base 2025-12-04T08:56:59.1472551Z * [new branch] gh/ezyang/3144/head -> origin/gh/ezyang/3144/head 2025-12-04T08:56:59.1474142Z * [new branch] gh/ezyang/3144/orig -> origin/gh/ezyang/3144/orig 2025-12-04T08:56:59.1476221Z * [new branch] gh/ezyang/3167/base -> origin/gh/ezyang/3167/base 2025-12-04T08:56:59.1477838Z * [new branch] gh/ezyang/3167/head -> origin/gh/ezyang/3167/head 2025-12-04T08:56:59.1479368Z * [new branch] gh/ezyang/3167/orig -> origin/gh/ezyang/3167/orig 2025-12-04T08:56:59.1481362Z * [new branch] gh/ezyang/3173/base -> origin/gh/ezyang/3173/base 2025-12-04T08:56:59.1483094Z * [new branch] gh/ezyang/3173/head -> origin/gh/ezyang/3173/head 2025-12-04T08:56:59.1484634Z * [new branch] gh/ezyang/3173/orig -> origin/gh/ezyang/3173/orig 2025-12-04T08:56:59.1486694Z * [new branch] gh/ezyang/3175/base -> origin/gh/ezyang/3175/base 2025-12-04T08:56:59.1488254Z * [new branch] gh/ezyang/3175/head -> origin/gh/ezyang/3175/head 2025-12-04T08:56:59.1489842Z * [new branch] gh/ezyang/3175/orig -> origin/gh/ezyang/3175/orig 2025-12-04T08:56:59.1491932Z * [new branch] gh/ezyang/3182/base -> origin/gh/ezyang/3182/base 2025-12-04T08:56:59.1493599Z * [new branch] gh/ezyang/3182/head -> origin/gh/ezyang/3182/head 2025-12-04T08:56:59.1495281Z * [new branch] gh/ezyang/3182/orig -> origin/gh/ezyang/3182/orig 2025-12-04T08:56:59.1497245Z * [new branch] gh/ezyang/3185/base -> origin/gh/ezyang/3185/base 2025-12-04T08:56:59.1498787Z * [new branch] gh/ezyang/3185/head -> origin/gh/ezyang/3185/head 2025-12-04T08:56:59.1500359Z * [new branch] gh/ezyang/3185/orig -> origin/gh/ezyang/3185/orig 2025-12-04T08:56:59.1502553Z * [new branch] gh/ezyang/3189/base -> origin/gh/ezyang/3189/base 2025-12-04T08:56:59.1504161Z * [new branch] gh/ezyang/3189/head -> origin/gh/ezyang/3189/head 2025-12-04T08:56:59.1505660Z * [new branch] gh/ezyang/3189/orig -> origin/gh/ezyang/3189/orig 2025-12-04T08:56:59.1507664Z * [new branch] gh/ezyang/3191/base -> origin/gh/ezyang/3191/base 2025-12-04T08:56:59.1509236Z * [new branch] gh/ezyang/3191/head -> origin/gh/ezyang/3191/head 2025-12-04T08:56:59.1510815Z * [new branch] gh/ezyang/3191/orig -> origin/gh/ezyang/3191/orig 2025-12-04T08:56:59.1513457Z * [new branch] gh/ezyang/3192/base -> origin/gh/ezyang/3192/base 2025-12-04T08:56:59.1515006Z * [new branch] gh/ezyang/3192/head -> origin/gh/ezyang/3192/head 2025-12-04T08:56:59.1516594Z * [new branch] gh/ezyang/3192/orig -> origin/gh/ezyang/3192/orig 2025-12-04T08:56:59.1518712Z * [new branch] gh/ezyang/3193/base -> origin/gh/ezyang/3193/base 2025-12-04T08:56:59.1520603Z * [new branch] gh/ezyang/3193/head -> origin/gh/ezyang/3193/head 2025-12-04T08:56:59.1522194Z * [new branch] gh/ezyang/3193/orig -> origin/gh/ezyang/3193/orig 2025-12-04T08:56:59.1524350Z * [new branch] gh/ezyang/3194/base -> origin/gh/ezyang/3194/base 2025-12-04T08:56:59.1525903Z * [new branch] gh/ezyang/3194/head -> origin/gh/ezyang/3194/head 2025-12-04T08:56:59.1527425Z * [new branch] gh/ezyang/3194/orig -> origin/gh/ezyang/3194/orig 2025-12-04T08:56:59.1529501Z * [new branch] gh/ezyang/3195/base -> origin/gh/ezyang/3195/base 2025-12-04T08:56:59.1531095Z * [new branch] gh/ezyang/3195/head -> origin/gh/ezyang/3195/head 2025-12-04T08:56:59.1532834Z * [new branch] gh/ezyang/3195/orig -> origin/gh/ezyang/3195/orig 2025-12-04T08:56:59.1534993Z * [new branch] gh/ezyang/3196/base -> origin/gh/ezyang/3196/base 2025-12-04T08:56:59.1536588Z * [new branch] gh/ezyang/3196/head -> origin/gh/ezyang/3196/head 2025-12-04T08:56:59.1538289Z * [new branch] gh/ezyang/3196/orig -> origin/gh/ezyang/3196/orig 2025-12-04T08:56:59.1540461Z * [new branch] gh/ezyang/3197/base -> origin/gh/ezyang/3197/base 2025-12-04T08:56:59.1542026Z * [new branch] gh/ezyang/3197/head -> origin/gh/ezyang/3197/head 2025-12-04T08:56:59.1543562Z * [new branch] gh/ezyang/3197/orig -> origin/gh/ezyang/3197/orig 2025-12-04T08:56:59.1545710Z * [new branch] gh/ezyang/3198/base -> origin/gh/ezyang/3198/base 2025-12-04T08:56:59.1547235Z * [new branch] gh/ezyang/3198/head -> origin/gh/ezyang/3198/head 2025-12-04T08:56:59.1548815Z * [new branch] gh/ezyang/3198/orig -> origin/gh/ezyang/3198/orig 2025-12-04T08:56:59.1551164Z * [new branch] gh/ezyang/3199/base -> origin/gh/ezyang/3199/base 2025-12-04T08:56:59.1552702Z * [new branch] gh/ezyang/3199/head -> origin/gh/ezyang/3199/head 2025-12-04T08:56:59.1554269Z * [new branch] gh/ezyang/3199/orig -> origin/gh/ezyang/3199/orig 2025-12-04T08:56:59.1556460Z * [new branch] gh/ezyang/3200/base -> origin/gh/ezyang/3200/base 2025-12-04T08:56:59.1558090Z * [new branch] gh/ezyang/3200/head -> origin/gh/ezyang/3200/head 2025-12-04T08:56:59.1559558Z * [new branch] gh/ezyang/3200/orig -> origin/gh/ezyang/3200/orig 2025-12-04T08:56:59.1561650Z * [new branch] gh/ezyang/3201/base -> origin/gh/ezyang/3201/base 2025-12-04T08:56:59.1564369Z * [new branch] gh/ezyang/3201/head -> origin/gh/ezyang/3201/head 2025-12-04T08:56:59.1566101Z * [new branch] gh/ezyang/3201/orig -> origin/gh/ezyang/3201/orig 2025-12-04T08:56:59.1568193Z * [new branch] gh/ezyang/3202/base -> origin/gh/ezyang/3202/base 2025-12-04T08:56:59.1569682Z * [new branch] gh/ezyang/3202/head -> origin/gh/ezyang/3202/head 2025-12-04T08:56:59.1571181Z * [new branch] gh/ezyang/3202/orig -> origin/gh/ezyang/3202/orig 2025-12-04T08:56:59.1573573Z * [new branch] gh/ezyang/3203/base -> origin/gh/ezyang/3203/base 2025-12-04T08:56:59.1574958Z * [new branch] gh/ezyang/3203/head -> origin/gh/ezyang/3203/head 2025-12-04T08:56:59.1576619Z * [new branch] gh/ezyang/3203/orig -> origin/gh/ezyang/3203/orig 2025-12-04T08:56:59.1578810Z * [new branch] gh/ezyang/3204/base -> origin/gh/ezyang/3204/base 2025-12-04T08:56:59.1580366Z * [new branch] gh/ezyang/3204/head -> origin/gh/ezyang/3204/head 2025-12-04T08:56:59.1581970Z * [new branch] gh/ezyang/3204/orig -> origin/gh/ezyang/3204/orig 2025-12-04T08:56:59.1584052Z * [new branch] gh/ezyang/3205/base -> origin/gh/ezyang/3205/base 2025-12-04T08:56:59.1585628Z * [new branch] gh/ezyang/3205/head -> origin/gh/ezyang/3205/head 2025-12-04T08:56:59.1587149Z * [new branch] gh/ezyang/3205/orig -> origin/gh/ezyang/3205/orig 2025-12-04T08:56:59.1589292Z * [new branch] gh/ezyang/3206/base -> origin/gh/ezyang/3206/base 2025-12-04T08:56:59.1590823Z * [new branch] gh/ezyang/3206/head -> origin/gh/ezyang/3206/head 2025-12-04T08:56:59.1592387Z * [new branch] gh/ezyang/3206/orig -> origin/gh/ezyang/3206/orig 2025-12-04T08:56:59.1594616Z * [new branch] gh/ezyang/3207/base -> origin/gh/ezyang/3207/base 2025-12-04T08:56:59.1596182Z * [new branch] gh/ezyang/3207/head -> origin/gh/ezyang/3207/head 2025-12-04T08:56:59.1597728Z * [new branch] gh/ezyang/3207/orig -> origin/gh/ezyang/3207/orig 2025-12-04T08:56:59.1599888Z * [new branch] gh/ezyang/3208/base -> origin/gh/ezyang/3208/base 2025-12-04T08:56:59.1601494Z * [new branch] gh/ezyang/3208/head -> origin/gh/ezyang/3208/head 2025-12-04T08:56:59.1603072Z * [new branch] gh/ezyang/3208/orig -> origin/gh/ezyang/3208/orig 2025-12-04T08:56:59.1605154Z * [new branch] gh/ezyang/3209/base -> origin/gh/ezyang/3209/base 2025-12-04T08:56:59.1606753Z * [new branch] gh/ezyang/3209/head -> origin/gh/ezyang/3209/head 2025-12-04T08:56:59.1608330Z * [new branch] gh/ezyang/3209/orig -> origin/gh/ezyang/3209/orig 2025-12-04T08:56:59.1610782Z * [new branch] gh/fadara01/3/base -> origin/gh/fadara01/3/base 2025-12-04T08:56:59.1612423Z * [new branch] gh/fadara01/3/head -> origin/gh/fadara01/3/head 2025-12-04T08:56:59.1614055Z * [new branch] gh/fadara01/3/orig -> origin/gh/fadara01/3/orig 2025-12-04T08:56:59.1616246Z * [new branch] gh/fadara01/5/base -> origin/gh/fadara01/5/base 2025-12-04T08:56:59.1617794Z * [new branch] gh/fadara01/5/head -> origin/gh/fadara01/5/head 2025-12-04T08:56:59.1619463Z * [new branch] gh/fadara01/5/orig -> origin/gh/fadara01/5/orig 2025-12-04T08:56:59.1621455Z * [new branch] gh/fadara01/6/base -> origin/gh/fadara01/6/base 2025-12-04T08:56:59.1622971Z * [new branch] gh/fadara01/6/head -> origin/gh/fadara01/6/head 2025-12-04T08:56:59.1624535Z * [new branch] gh/fadara01/6/orig -> origin/gh/fadara01/6/orig 2025-12-04T08:56:59.1626567Z * [new branch] gh/fadara01/7/base -> origin/gh/fadara01/7/base 2025-12-04T08:56:59.1628283Z * [new branch] gh/fadara01/7/head -> origin/gh/fadara01/7/head 2025-12-04T08:56:59.1629823Z * [new branch] gh/fadara01/7/orig -> origin/gh/fadara01/7/orig 2025-12-04T08:56:59.1631884Z * [new branch] gh/fadara01/8/base -> origin/gh/fadara01/8/base 2025-12-04T08:56:59.1633427Z * [new branch] gh/fadara01/8/head -> origin/gh/fadara01/8/head 2025-12-04T08:56:59.1635002Z * [new branch] gh/fadara01/8/orig -> origin/gh/fadara01/8/orig 2025-12-04T08:56:59.1637134Z * [new branch] gh/fadara01/9/base -> origin/gh/fadara01/9/base 2025-12-04T08:56:59.1638660Z * [new branch] gh/fadara01/9/head -> origin/gh/fadara01/9/head 2025-12-04T08:56:59.1640302Z * [new branch] gh/fadara01/9/orig -> origin/gh/fadara01/9/orig 2025-12-04T08:56:59.1642840Z * [new branch] gh/fduwjj/182/base -> origin/gh/fduwjj/182/base 2025-12-04T08:56:59.1644443Z * [new branch] gh/fduwjj/182/head -> origin/gh/fduwjj/182/head 2025-12-04T08:56:59.1645931Z * [new branch] gh/fduwjj/182/orig -> origin/gh/fduwjj/182/orig 2025-12-04T08:56:59.1647978Z * [new branch] gh/fduwjj/211/base -> origin/gh/fduwjj/211/base 2025-12-04T08:56:59.1649698Z * [new branch] gh/fduwjj/211/head -> origin/gh/fduwjj/211/head 2025-12-04T08:56:59.1651826Z * [new branch] gh/fduwjj/211/orig -> origin/gh/fduwjj/211/orig 2025-12-04T08:56:59.1654063Z * [new branch] gh/fduwjj/212/base -> origin/gh/fduwjj/212/base 2025-12-04T08:56:59.1655639Z * [new branch] gh/fduwjj/212/head -> origin/gh/fduwjj/212/head 2025-12-04T08:56:59.1657244Z * [new branch] gh/fduwjj/212/orig -> origin/gh/fduwjj/212/orig 2025-12-04T08:56:59.1659240Z * [new branch] gh/fduwjj/213/base -> origin/gh/fduwjj/213/base 2025-12-04T08:56:59.1660875Z * [new branch] gh/fduwjj/213/head -> origin/gh/fduwjj/213/head 2025-12-04T08:56:59.1662650Z * [new branch] gh/fduwjj/213/orig -> origin/gh/fduwjj/213/orig 2025-12-04T08:56:59.1664803Z * [new branch] gh/fduwjj/226/base -> origin/gh/fduwjj/226/base 2025-12-04T08:56:59.1666335Z * [new branch] gh/fduwjj/226/head -> origin/gh/fduwjj/226/head 2025-12-04T08:56:59.1667849Z * [new branch] gh/fduwjj/226/orig -> origin/gh/fduwjj/226/orig 2025-12-04T08:56:59.1670091Z * [new branch] gh/fduwjj/229/base -> origin/gh/fduwjj/229/base 2025-12-04T08:56:59.1671581Z * [new branch] gh/fduwjj/229/head -> origin/gh/fduwjj/229/head 2025-12-04T08:56:59.1673084Z * [new branch] gh/fduwjj/229/orig -> origin/gh/fduwjj/229/orig 2025-12-04T08:56:59.1675137Z * [new branch] gh/fduwjj/233/base -> origin/gh/fduwjj/233/base 2025-12-04T08:56:59.1676766Z * [new branch] gh/fduwjj/233/head -> origin/gh/fduwjj/233/head 2025-12-04T08:56:59.1678328Z * [new branch] gh/fduwjj/233/orig -> origin/gh/fduwjj/233/orig 2025-12-04T08:56:59.1680460Z * [new branch] gh/fduwjj/234/base -> origin/gh/fduwjj/234/base 2025-12-04T08:56:59.1682278Z * [new branch] gh/fduwjj/234/head -> origin/gh/fduwjj/234/head 2025-12-04T08:56:59.1683862Z * [new branch] gh/fduwjj/234/orig -> origin/gh/fduwjj/234/orig 2025-12-04T08:56:59.1685925Z * [new branch] gh/fduwjj/235/base -> origin/gh/fduwjj/235/base 2025-12-04T08:56:59.1687554Z * [new branch] gh/fduwjj/235/head -> origin/gh/fduwjj/235/head 2025-12-04T08:56:59.1689111Z * [new branch] gh/fduwjj/235/orig -> origin/gh/fduwjj/235/orig 2025-12-04T08:56:59.1691050Z * [new branch] gh/fduwjj/236/base -> origin/gh/fduwjj/236/base 2025-12-04T08:56:59.1692968Z * [new branch] gh/fduwjj/236/head -> origin/gh/fduwjj/236/head 2025-12-04T08:56:59.1694460Z * [new branch] gh/fduwjj/236/orig -> origin/gh/fduwjj/236/orig 2025-12-04T08:56:59.1696297Z * [new branch] gh/fduwjj/237/base -> origin/gh/fduwjj/237/base 2025-12-04T08:56:59.1697854Z * [new branch] gh/fduwjj/237/head -> origin/gh/fduwjj/237/head 2025-12-04T08:56:59.1699455Z * [new branch] gh/fduwjj/237/orig -> origin/gh/fduwjj/237/orig 2025-12-04T08:56:59.1701949Z * [new branch] gh/fduwjj/238/base -> origin/gh/fduwjj/238/base 2025-12-04T08:56:59.1703654Z * [new branch] gh/fduwjj/238/head -> origin/gh/fduwjj/238/head 2025-12-04T08:56:59.1705328Z * [new branch] gh/fduwjj/238/orig -> origin/gh/fduwjj/238/orig 2025-12-04T08:56:59.1707386Z * [new branch] gh/fduwjj/239/base -> origin/gh/fduwjj/239/base 2025-12-04T08:56:59.1709013Z * [new branch] gh/fduwjj/239/head -> origin/gh/fduwjj/239/head 2025-12-04T08:56:59.1710561Z * [new branch] gh/fduwjj/239/orig -> origin/gh/fduwjj/239/orig 2025-12-04T08:56:59.1713215Z * [new branch] gh/fegin/332/base -> origin/gh/fegin/332/base 2025-12-04T08:56:59.1714794Z * [new branch] gh/fegin/332/head -> origin/gh/fegin/332/head 2025-12-04T08:56:59.1716368Z * [new branch] gh/fegin/332/orig -> origin/gh/fegin/332/orig 2025-12-04T08:56:59.1719068Z * [new branch] gh/fegin/333/base -> origin/gh/fegin/333/base 2025-12-04T08:56:59.1720705Z * [new branch] gh/fegin/333/head -> origin/gh/fegin/333/head 2025-12-04T08:56:59.1722290Z * [new branch] gh/fegin/333/orig -> origin/gh/fegin/333/orig 2025-12-04T08:56:59.1724437Z * [new branch] gh/fegin/334/base -> origin/gh/fegin/334/base 2025-12-04T08:56:59.1726092Z * [new branch] gh/fegin/334/head -> origin/gh/fegin/334/head 2025-12-04T08:56:59.1727728Z * [new branch] gh/fegin/334/orig -> origin/gh/fegin/334/orig 2025-12-04T08:56:59.1729778Z * [new branch] gh/fegin/335/base -> origin/gh/fegin/335/base 2025-12-04T08:56:59.1731380Z * [new branch] gh/fegin/335/head -> origin/gh/fegin/335/head 2025-12-04T08:56:59.1733002Z * [new branch] gh/fegin/335/orig -> origin/gh/fegin/335/orig 2025-12-04T08:56:59.1735488Z * [new branch] gh/fffrog/160/base -> origin/gh/fffrog/160/base 2025-12-04T08:56:59.1737217Z * [new branch] gh/fffrog/160/head -> origin/gh/fffrog/160/head 2025-12-04T08:56:59.1739360Z * [new branch] gh/fffrog/177/base -> origin/gh/fffrog/177/base 2025-12-04T08:56:59.1740889Z * [new branch] gh/fffrog/177/head -> origin/gh/fffrog/177/head 2025-12-04T08:56:59.1742423Z * [new branch] gh/fffrog/177/orig -> origin/gh/fffrog/177/orig 2025-12-04T08:56:59.1744636Z * [new branch] gh/fffrog/178/base -> origin/gh/fffrog/178/base 2025-12-04T08:56:59.1746192Z * [new branch] gh/fffrog/178/head -> origin/gh/fffrog/178/head 2025-12-04T08:56:59.1747754Z * [new branch] gh/fffrog/178/orig -> origin/gh/fffrog/178/orig 2025-12-04T08:56:59.1749764Z * [new branch] gh/fffrog/181/base -> origin/gh/fffrog/181/base 2025-12-04T08:56:59.1751365Z * [new branch] gh/fffrog/181/head -> origin/gh/fffrog/181/head 2025-12-04T08:56:59.1753375Z * [new branch] gh/fffrog/181/orig -> origin/gh/fffrog/181/orig 2025-12-04T08:56:59.1755435Z * [new branch] gh/fffrog/183/base -> origin/gh/fffrog/183/base 2025-12-04T08:56:59.1757027Z * [new branch] gh/fffrog/183/head -> origin/gh/fffrog/183/head 2025-12-04T08:56:59.1758652Z * [new branch] gh/fffrog/183/orig -> origin/gh/fffrog/183/orig 2025-12-04T08:56:59.1761131Z * [new branch] gh/fxdawnn/10/base -> origin/gh/fxdawnn/10/base 2025-12-04T08:56:59.1762772Z * [new branch] gh/fxdawnn/10/head -> origin/gh/fxdawnn/10/head 2025-12-04T08:56:59.1764390Z * [new branch] gh/fxdawnn/10/orig -> origin/gh/fxdawnn/10/orig 2025-12-04T08:56:59.1766648Z * [new branch] gh/fxdawnn/11/base -> origin/gh/fxdawnn/11/base 2025-12-04T08:56:59.1768180Z * [new branch] gh/fxdawnn/11/head -> origin/gh/fxdawnn/11/head 2025-12-04T08:56:59.1769852Z * [new branch] gh/fxdawnn/11/orig -> origin/gh/fxdawnn/11/orig 2025-12-04T08:56:59.1772047Z * [new branch] gh/fxdawnn/12/base -> origin/gh/fxdawnn/12/base 2025-12-04T08:56:59.1773892Z * [new branch] gh/fxdawnn/12/head -> origin/gh/fxdawnn/12/head 2025-12-04T08:56:59.1775369Z * [new branch] gh/fxdawnn/12/orig -> origin/gh/fxdawnn/12/orig 2025-12-04T08:56:59.1777365Z * [new branch] gh/fxdawnn/13/base -> origin/gh/fxdawnn/13/base 2025-12-04T08:56:59.1779127Z * [new branch] gh/fxdawnn/13/head -> origin/gh/fxdawnn/13/head 2025-12-04T08:56:59.1780562Z * [new branch] gh/fxdawnn/13/orig -> origin/gh/fxdawnn/13/orig 2025-12-04T08:56:59.1782770Z * [new branch] gh/fxdawnn/14/base -> origin/gh/fxdawnn/14/base 2025-12-04T08:56:59.1784255Z * [new branch] gh/fxdawnn/14/head -> origin/gh/fxdawnn/14/head 2025-12-04T08:56:59.1785787Z * [new branch] gh/fxdawnn/14/orig -> origin/gh/fxdawnn/14/orig 2025-12-04T08:56:59.1787834Z * [new branch] gh/fxdawnn/15/base -> origin/gh/fxdawnn/15/base 2025-12-04T08:56:59.1789441Z * [new branch] gh/fxdawnn/15/head -> origin/gh/fxdawnn/15/head 2025-12-04T08:56:59.1790992Z * [new branch] gh/fxdawnn/15/orig -> origin/gh/fxdawnn/15/orig 2025-12-04T08:56:59.1793007Z * [new branch] gh/fxdawnn/6/base -> origin/gh/fxdawnn/6/base 2025-12-04T08:56:59.1794584Z * [new branch] gh/fxdawnn/6/head -> origin/gh/fxdawnn/6/head 2025-12-04T08:56:59.1796166Z * [new branch] gh/fxdawnn/6/orig -> origin/gh/fxdawnn/6/orig 2025-12-04T08:56:59.1798197Z * [new branch] gh/fxdawnn/7/base -> origin/gh/fxdawnn/7/base 2025-12-04T08:56:59.1799883Z * [new branch] gh/fxdawnn/7/head -> origin/gh/fxdawnn/7/head 2025-12-04T08:56:59.1801372Z * [new branch] gh/fxdawnn/7/orig -> origin/gh/fxdawnn/7/orig 2025-12-04T08:56:59.1803371Z * [new branch] gh/fxdawnn/9/base -> origin/gh/fxdawnn/9/base 2025-12-04T08:56:59.1804834Z * [new branch] gh/fxdawnn/9/head -> origin/gh/fxdawnn/9/head 2025-12-04T08:56:59.1806348Z * [new branch] gh/fxdawnn/9/orig -> origin/gh/fxdawnn/9/orig 2025-12-04T08:56:59.1808991Z * [new branch] gh/galv/1/base -> origin/gh/galv/1/base 2025-12-04T08:56:59.1810517Z * [new branch] gh/galv/1/head -> origin/gh/galv/1/head 2025-12-04T08:56:59.1812148Z * [new branch] gh/galv/1/orig -> origin/gh/galv/1/orig 2025-12-04T08:56:59.1814395Z * [new branch] gh/galv/2/base -> origin/gh/galv/2/base 2025-12-04T08:56:59.1815869Z * [new branch] gh/galv/2/head -> origin/gh/galv/2/head 2025-12-04T08:56:59.1817586Z * [new branch] gh/galv/2/orig -> origin/gh/galv/2/orig 2025-12-04T08:56:59.1819695Z * [new branch] gh/galv/3/base -> origin/gh/galv/3/base 2025-12-04T08:56:59.1821529Z * [new branch] gh/galv/3/head -> origin/gh/galv/3/head 2025-12-04T08:56:59.1823031Z * [new branch] gh/galv/3/orig -> origin/gh/galv/3/orig 2025-12-04T08:56:59.1825335Z * [new branch] gh/guangyey/134/base -> origin/gh/guangyey/134/base 2025-12-04T08:56:59.1826938Z * [new branch] gh/guangyey/134/head -> origin/gh/guangyey/134/head 2025-12-04T08:56:59.1828518Z * [new branch] gh/guangyey/134/orig -> origin/gh/guangyey/134/orig 2025-12-04T08:56:59.1830544Z * [new branch] gh/guangyey/163/base -> origin/gh/guangyey/163/base 2025-12-04T08:56:59.1832086Z * [new branch] gh/guangyey/163/head -> origin/gh/guangyey/163/head 2025-12-04T08:56:59.1833660Z * [new branch] gh/guangyey/163/orig -> origin/gh/guangyey/163/orig 2025-12-04T08:56:59.1835765Z * [new branch] gh/guangyey/168/base -> origin/gh/guangyey/168/base 2025-12-04T08:56:59.1837775Z * [new branch] gh/guangyey/168/head -> origin/gh/guangyey/168/head 2025-12-04T08:56:59.1839380Z * [new branch] gh/guangyey/168/orig -> origin/gh/guangyey/168/orig 2025-12-04T08:56:59.1841449Z * [new branch] gh/guangyey/169/base -> origin/gh/guangyey/169/base 2025-12-04T08:56:59.1843061Z * [new branch] gh/guangyey/169/head -> origin/gh/guangyey/169/head 2025-12-04T08:56:59.1844609Z * [new branch] gh/guangyey/169/orig -> origin/gh/guangyey/169/orig 2025-12-04T08:56:59.1846689Z * [new branch] gh/guangyey/170/base -> origin/gh/guangyey/170/base 2025-12-04T08:56:59.1848316Z * [new branch] gh/guangyey/170/head -> origin/gh/guangyey/170/head 2025-12-04T08:56:59.1849819Z * [new branch] gh/guangyey/170/orig -> origin/gh/guangyey/170/orig 2025-12-04T08:56:59.1851864Z * [new branch] gh/guangyey/171/base -> origin/gh/guangyey/171/base 2025-12-04T08:56:59.1853575Z * [new branch] gh/guangyey/171/head -> origin/gh/guangyey/171/head 2025-12-04T08:56:59.1855164Z * [new branch] gh/guangyey/171/orig -> origin/gh/guangyey/171/orig 2025-12-04T08:56:59.1857192Z * [new branch] gh/guangyey/178/base -> origin/gh/guangyey/178/base 2025-12-04T08:56:59.1858824Z * [new branch] gh/guangyey/178/head -> origin/gh/guangyey/178/head 2025-12-04T08:56:59.1860354Z * [new branch] gh/guangyey/178/orig -> origin/gh/guangyey/178/orig 2025-12-04T08:56:59.1862441Z * [new branch] gh/guangyey/182/base -> origin/gh/guangyey/182/base 2025-12-04T08:56:59.1864165Z * [new branch] gh/guangyey/182/head -> origin/gh/guangyey/182/head 2025-12-04T08:56:59.1865760Z * [new branch] gh/guangyey/182/orig -> origin/gh/guangyey/182/orig 2025-12-04T08:56:59.1867768Z * [new branch] gh/guangyey/183/base -> origin/gh/guangyey/183/base 2025-12-04T08:56:59.1869400Z * [new branch] gh/guangyey/183/head -> origin/gh/guangyey/183/head 2025-12-04T08:56:59.1871018Z * [new branch] gh/guangyey/183/orig -> origin/gh/guangyey/183/orig 2025-12-04T08:56:59.1873106Z * [new branch] gh/guangyey/185/base -> origin/gh/guangyey/185/base 2025-12-04T08:56:59.1874704Z * [new branch] gh/guangyey/185/head -> origin/gh/guangyey/185/head 2025-12-04T08:56:59.1876238Z * [new branch] gh/guangyey/185/orig -> origin/gh/guangyey/185/orig 2025-12-04T08:56:59.1878319Z * [new branch] gh/guangyey/186/base -> origin/gh/guangyey/186/base 2025-12-04T08:56:59.1879952Z * [new branch] gh/guangyey/186/head -> origin/gh/guangyey/186/head 2025-12-04T08:56:59.1881422Z * [new branch] gh/guangyey/186/orig -> origin/gh/guangyey/186/orig 2025-12-04T08:56:59.1883525Z * [new branch] gh/guangyey/187/base -> origin/gh/guangyey/187/base 2025-12-04T08:56:59.1885209Z * [new branch] gh/guangyey/187/head -> origin/gh/guangyey/187/head 2025-12-04T08:56:59.1886642Z * [new branch] gh/guangyey/187/orig -> origin/gh/guangyey/187/orig 2025-12-04T08:56:59.1888736Z * [new branch] gh/guangyey/188/base -> origin/gh/guangyey/188/base 2025-12-04T08:56:59.1890342Z * [new branch] gh/guangyey/188/head -> origin/gh/guangyey/188/head 2025-12-04T08:56:59.1891890Z * [new branch] gh/guangyey/188/orig -> origin/gh/guangyey/188/orig 2025-12-04T08:56:59.1894078Z * [new branch] gh/guangyey/190/base -> origin/gh/guangyey/190/base 2025-12-04T08:56:59.1895556Z * [new branch] gh/guangyey/190/head -> origin/gh/guangyey/190/head 2025-12-04T08:56:59.1897103Z * [new branch] gh/guangyey/190/orig -> origin/gh/guangyey/190/orig 2025-12-04T08:56:59.1899224Z * [new branch] gh/guangyey/208/base -> origin/gh/guangyey/208/base 2025-12-04T08:56:59.1900767Z * [new branch] gh/guangyey/208/head -> origin/gh/guangyey/208/head 2025-12-04T08:56:59.1902388Z * [new branch] gh/guangyey/208/orig -> origin/gh/guangyey/208/orig 2025-12-04T08:56:59.1904498Z * [new branch] gh/guangyey/228/base -> origin/gh/guangyey/228/base 2025-12-04T08:56:59.1906027Z * [new branch] gh/guangyey/228/head -> origin/gh/guangyey/228/head 2025-12-04T08:56:59.1907646Z * [new branch] gh/guangyey/228/orig -> origin/gh/guangyey/228/orig 2025-12-04T08:56:59.1910185Z * [new branch] gh/guangyey/230/base -> origin/gh/guangyey/230/base 2025-12-04T08:56:59.1911784Z * [new branch] gh/guangyey/230/head -> origin/gh/guangyey/230/head 2025-12-04T08:56:59.1913291Z * [new branch] gh/guangyey/230/orig -> origin/gh/guangyey/230/orig 2025-12-04T08:56:59.1915463Z * [new branch] gh/guangyey/231/base -> origin/gh/guangyey/231/base 2025-12-04T08:56:59.1916991Z * [new branch] gh/guangyey/231/head -> origin/gh/guangyey/231/head 2025-12-04T08:56:59.1918733Z * [new branch] gh/guangyey/231/orig -> origin/gh/guangyey/231/orig 2025-12-04T08:56:59.1920934Z * [new branch] gh/guangyey/232/base -> origin/gh/guangyey/232/base 2025-12-04T08:56:59.1922505Z * [new branch] gh/guangyey/232/head -> origin/gh/guangyey/232/head 2025-12-04T08:56:59.1924029Z * [new branch] gh/guangyey/232/orig -> origin/gh/guangyey/232/orig 2025-12-04T08:56:59.1926140Z * [new branch] gh/guangyey/233/base -> origin/gh/guangyey/233/base 2025-12-04T08:56:59.1927801Z * [new branch] gh/guangyey/233/head -> origin/gh/guangyey/233/head 2025-12-04T08:56:59.1929316Z * [new branch] gh/guangyey/233/orig -> origin/gh/guangyey/233/orig 2025-12-04T08:56:59.1931433Z * [new branch] gh/guangyey/234/base -> origin/gh/guangyey/234/base 2025-12-04T08:56:59.1933212Z * [new branch] gh/guangyey/234/head -> origin/gh/guangyey/234/head 2025-12-04T08:56:59.1934727Z * [new branch] gh/guangyey/234/orig -> origin/gh/guangyey/234/orig 2025-12-04T08:56:59.1936834Z * [new branch] gh/guangyey/235/base -> origin/gh/guangyey/235/base 2025-12-04T08:56:59.1938372Z * [new branch] gh/guangyey/235/head -> origin/gh/guangyey/235/head 2025-12-04T08:56:59.1939933Z * [new branch] gh/guangyey/235/orig -> origin/gh/guangyey/235/orig 2025-12-04T08:56:59.1942083Z * [new branch] gh/guangyey/236/base -> origin/gh/guangyey/236/base 2025-12-04T08:56:59.1943619Z * [new branch] gh/guangyey/236/head -> origin/gh/guangyey/236/head 2025-12-04T08:56:59.1945203Z * [new branch] gh/guangyey/236/orig -> origin/gh/guangyey/236/orig 2025-12-04T08:56:59.1947939Z * [new branch] gh/guangyey/237/base -> origin/gh/guangyey/237/base 2025-12-04T08:56:59.1949406Z * [new branch] gh/guangyey/237/head -> origin/gh/guangyey/237/head 2025-12-04T08:56:59.1950938Z * [new branch] gh/guangyey/237/orig -> origin/gh/guangyey/237/orig 2025-12-04T08:56:59.1953141Z * [new branch] gh/guangyey/238/base -> origin/gh/guangyey/238/base 2025-12-04T08:56:59.1954705Z * [new branch] gh/guangyey/238/head -> origin/gh/guangyey/238/head 2025-12-04T08:56:59.1956841Z * [new branch] gh/guangyey/239/base -> origin/gh/guangyey/239/base 2025-12-04T08:56:59.1958418Z * [new branch] gh/guangyey/239/head -> origin/gh/guangyey/239/head 2025-12-04T08:56:59.1959974Z * [new branch] gh/guangyey/239/orig -> origin/gh/guangyey/239/orig 2025-12-04T08:56:59.1962313Z * [new branch] gh/guangyey/240/base -> origin/gh/guangyey/240/base 2025-12-04T08:56:59.1965564Z * [new branch] gh/guangyey/240/head -> origin/gh/guangyey/240/head 2025-12-04T08:56:59.1967114Z * [new branch] gh/guangyey/240/orig -> origin/gh/guangyey/240/orig 2025-12-04T08:56:59.1969259Z * [new branch] gh/guangyey/241/base -> origin/gh/guangyey/241/base 2025-12-04T08:56:59.1970809Z * [new branch] gh/guangyey/241/head -> origin/gh/guangyey/241/head 2025-12-04T08:56:59.1972460Z * [new branch] gh/guangyey/241/orig -> origin/gh/guangyey/241/orig 2025-12-04T08:56:59.1974683Z * [new branch] gh/guangyey/242/base -> origin/gh/guangyey/242/base 2025-12-04T08:56:59.1976202Z * [new branch] gh/guangyey/242/head -> origin/gh/guangyey/242/head 2025-12-04T08:56:59.1977815Z * [new branch] gh/guangyey/242/orig -> origin/gh/guangyey/242/orig 2025-12-04T08:56:59.1979952Z * [new branch] gh/guangyey/243/base -> origin/gh/guangyey/243/base 2025-12-04T08:56:59.1981461Z * [new branch] gh/guangyey/243/head -> origin/gh/guangyey/243/head 2025-12-04T08:56:59.1982956Z * [new branch] gh/guangyey/243/orig -> origin/gh/guangyey/243/orig 2025-12-04T08:56:59.1985231Z * [new branch] gh/guangyey/244/base -> origin/gh/guangyey/244/base 2025-12-04T08:56:59.1986795Z * [new branch] gh/guangyey/244/head -> origin/gh/guangyey/244/head 2025-12-04T08:56:59.1988406Z * [new branch] gh/guangyey/244/orig -> origin/gh/guangyey/244/orig 2025-12-04T08:56:59.1990572Z * [new branch] gh/guangyey/245/base -> origin/gh/guangyey/245/base 2025-12-04T08:56:59.1992089Z * [new branch] gh/guangyey/245/head -> origin/gh/guangyey/245/head 2025-12-04T08:56:59.1993608Z * [new branch] gh/guangyey/245/orig -> origin/gh/guangyey/245/orig 2025-12-04T08:56:59.1995774Z * [new branch] gh/guangyey/246/base -> origin/gh/guangyey/246/base 2025-12-04T08:56:59.1997353Z * [new branch] gh/guangyey/246/head -> origin/gh/guangyey/246/head 2025-12-04T08:56:59.1998898Z * [new branch] gh/guangyey/246/orig -> origin/gh/guangyey/246/orig 2025-12-04T08:56:59.2001074Z * [new branch] gh/guangyey/247/base -> origin/gh/guangyey/247/base 2025-12-04T08:56:59.2002708Z * [new branch] gh/guangyey/247/head -> origin/gh/guangyey/247/head 2025-12-04T08:56:59.2004264Z * [new branch] gh/guangyey/247/orig -> origin/gh/guangyey/247/orig 2025-12-04T08:56:59.2006408Z * [new branch] gh/guangyey/248/base -> origin/gh/guangyey/248/base 2025-12-04T08:56:59.2007921Z * [new branch] gh/guangyey/248/head -> origin/gh/guangyey/248/head 2025-12-04T08:56:59.2009518Z * [new branch] gh/guangyey/248/orig -> origin/gh/guangyey/248/orig 2025-12-04T08:56:59.2011792Z * [new branch] gh/guangyey/249/base -> origin/gh/guangyey/249/base 2025-12-04T08:56:59.2013331Z * [new branch] gh/guangyey/249/head -> origin/gh/guangyey/249/head 2025-12-04T08:56:59.2014954Z * [new branch] gh/guangyey/249/orig -> origin/gh/guangyey/249/orig 2025-12-04T08:56:59.2017080Z * [new branch] gh/guangyey/250/base -> origin/gh/guangyey/250/base 2025-12-04T08:56:59.2018637Z * [new branch] gh/guangyey/250/head -> origin/gh/guangyey/250/head 2025-12-04T08:56:59.2020263Z * [new branch] gh/guangyey/250/orig -> origin/gh/guangyey/250/orig 2025-12-04T08:56:59.2022416Z * [new branch] gh/guangyey/251/base -> origin/gh/guangyey/251/base 2025-12-04T08:56:59.2023922Z * [new branch] gh/guangyey/251/head -> origin/gh/guangyey/251/head 2025-12-04T08:56:59.2025471Z * [new branch] gh/guangyey/251/orig -> origin/gh/guangyey/251/orig 2025-12-04T08:56:59.2027640Z * [new branch] gh/guangyey/252/base -> origin/gh/guangyey/252/base 2025-12-04T08:56:59.2029172Z * [new branch] gh/guangyey/252/head -> origin/gh/guangyey/252/head 2025-12-04T08:56:59.2030716Z * [new branch] gh/guangyey/252/orig -> origin/gh/guangyey/252/orig 2025-12-04T08:56:59.2032914Z * [new branch] gh/guangyey/253/base -> origin/gh/guangyey/253/base 2025-12-04T08:56:59.2034419Z * [new branch] gh/guangyey/253/head -> origin/gh/guangyey/253/head 2025-12-04T08:56:59.2035957Z * [new branch] gh/guangyey/253/orig -> origin/gh/guangyey/253/orig 2025-12-04T08:56:59.2038049Z * [new branch] gh/guangyey/254/base -> origin/gh/guangyey/254/base 2025-12-04T08:56:59.2039570Z * [new branch] gh/guangyey/254/head -> origin/gh/guangyey/254/head 2025-12-04T08:56:59.2041303Z * [new branch] gh/guangyey/254/orig -> origin/gh/guangyey/254/orig 2025-12-04T08:56:59.2043382Z * [new branch] gh/guangyey/255/base -> origin/gh/guangyey/255/base 2025-12-04T08:56:59.2045496Z * [new branch] gh/guangyey/255/head -> origin/gh/guangyey/255/head 2025-12-04T08:56:59.2047070Z * [new branch] gh/guangyey/255/orig -> origin/gh/guangyey/255/orig 2025-12-04T08:56:59.2049761Z * [new branch] gh/guilhermeleobas/107/base -> origin/gh/guilhermeleobas/107/base 2025-12-04T08:56:59.2051302Z * [new branch] gh/guilhermeleobas/107/head -> origin/gh/guilhermeleobas/107/head 2025-12-04T08:56:59.2053089Z * [new branch] gh/guilhermeleobas/107/orig -> origin/gh/guilhermeleobas/107/orig 2025-12-04T08:56:59.2055186Z * [new branch] gh/guilhermeleobas/108/base -> origin/gh/guilhermeleobas/108/base 2025-12-04T08:56:59.2056843Z * [new branch] gh/guilhermeleobas/108/head -> origin/gh/guilhermeleobas/108/head 2025-12-04T08:56:59.2058435Z * [new branch] gh/guilhermeleobas/108/orig -> origin/gh/guilhermeleobas/108/orig 2025-12-04T08:56:59.2060639Z * [new branch] gh/guilhermeleobas/150/base -> origin/gh/guilhermeleobas/150/base 2025-12-04T08:56:59.2062973Z * [new branch] gh/guilhermeleobas/150/head -> origin/gh/guilhermeleobas/150/head 2025-12-04T08:56:59.2066595Z * [new branch] gh/guilhermeleobas/150/orig -> origin/gh/guilhermeleobas/150/orig 2025-12-04T08:56:59.2069498Z * [new branch] gh/guilhermeleobas/168/base -> origin/gh/guilhermeleobas/168/base 2025-12-04T08:56:59.2071603Z * [new branch] gh/guilhermeleobas/168/head -> origin/gh/guilhermeleobas/168/head 2025-12-04T08:56:59.2073895Z * [new branch] gh/guilhermeleobas/168/orig -> origin/gh/guilhermeleobas/168/orig 2025-12-04T08:56:59.2076825Z * [new branch] gh/guilhermeleobas/169/base -> origin/gh/guilhermeleobas/169/base 2025-12-04T08:56:59.2079839Z * [new branch] gh/guilhermeleobas/169/head -> origin/gh/guilhermeleobas/169/head 2025-12-04T08:56:59.2080455Z * [new branch] gh/guilhermeleobas/169/orig -> origin/gh/guilhermeleobas/169/orig 2025-12-04T08:56:59.2082779Z * [new branch] gh/guilhermeleobas/170/base -> origin/gh/guilhermeleobas/170/base 2025-12-04T08:56:59.2084382Z * [new branch] gh/guilhermeleobas/170/head -> origin/gh/guilhermeleobas/170/head 2025-12-04T08:56:59.2086137Z * [new branch] gh/guilhermeleobas/170/orig -> origin/gh/guilhermeleobas/170/orig 2025-12-04T08:56:59.2088103Z * [new branch] gh/guilhermeleobas/171/base -> origin/gh/guilhermeleobas/171/base 2025-12-04T08:56:59.2089789Z * [new branch] gh/guilhermeleobas/171/head -> origin/gh/guilhermeleobas/171/head 2025-12-04T08:56:59.2091345Z * [new branch] gh/guilhermeleobas/171/orig -> origin/gh/guilhermeleobas/171/orig 2025-12-04T08:56:59.2093520Z * [new branch] gh/guilhermeleobas/173/base -> origin/gh/guilhermeleobas/173/base 2025-12-04T08:56:59.2095132Z * [new branch] gh/guilhermeleobas/173/head -> origin/gh/guilhermeleobas/173/head 2025-12-04T08:56:59.2096673Z * [new branch] gh/guilhermeleobas/173/orig -> origin/gh/guilhermeleobas/173/orig 2025-12-04T08:56:59.2098636Z * [new branch] gh/guilhermeleobas/193/base -> origin/gh/guilhermeleobas/193/base 2025-12-04T08:56:59.2100296Z * [new branch] gh/guilhermeleobas/193/head -> origin/gh/guilhermeleobas/193/head 2025-12-04T08:56:59.2101943Z * [new branch] gh/guilhermeleobas/193/orig -> origin/gh/guilhermeleobas/193/orig 2025-12-04T08:56:59.2104124Z * [new branch] gh/guilhermeleobas/204/base -> origin/gh/guilhermeleobas/204/base 2025-12-04T08:56:59.2105685Z * [new branch] gh/guilhermeleobas/204/head -> origin/gh/guilhermeleobas/204/head 2025-12-04T08:56:59.2107257Z * [new branch] gh/guilhermeleobas/204/orig -> origin/gh/guilhermeleobas/204/orig 2025-12-04T08:56:59.2109320Z * [new branch] gh/guilhermeleobas/211/base -> origin/gh/guilhermeleobas/211/base 2025-12-04T08:56:59.2111141Z * [new branch] gh/guilhermeleobas/211/head -> origin/gh/guilhermeleobas/211/head 2025-12-04T08:56:59.2112716Z * [new branch] gh/guilhermeleobas/211/orig -> origin/gh/guilhermeleobas/211/orig 2025-12-04T08:56:59.2114743Z * [new branch] gh/guilhermeleobas/226/base -> origin/gh/guilhermeleobas/226/base 2025-12-04T08:56:59.2116331Z * [new branch] gh/guilhermeleobas/226/head -> origin/gh/guilhermeleobas/226/head 2025-12-04T08:56:59.2117870Z * [new branch] gh/guilhermeleobas/226/orig -> origin/gh/guilhermeleobas/226/orig 2025-12-04T08:56:59.2119960Z * [new branch] gh/guilhermeleobas/236/base -> origin/gh/guilhermeleobas/236/base 2025-12-04T08:56:59.2121497Z * [new branch] gh/guilhermeleobas/236/head -> origin/gh/guilhermeleobas/236/head 2025-12-04T08:56:59.2123024Z * [new branch] gh/guilhermeleobas/236/orig -> origin/gh/guilhermeleobas/236/orig 2025-12-04T08:56:59.2125083Z * [new branch] gh/guilhermeleobas/247/base -> origin/gh/guilhermeleobas/247/base 2025-12-04T08:56:59.2126628Z * [new branch] gh/guilhermeleobas/247/head -> origin/gh/guilhermeleobas/247/head 2025-12-04T08:56:59.2128188Z * [new branch] gh/guilhermeleobas/247/orig -> origin/gh/guilhermeleobas/247/orig 2025-12-04T08:56:59.2130240Z * [new branch] gh/guilhermeleobas/248/base -> origin/gh/guilhermeleobas/248/base 2025-12-04T08:56:59.2131887Z * [new branch] gh/guilhermeleobas/248/head -> origin/gh/guilhermeleobas/248/head 2025-12-04T08:56:59.2133567Z * [new branch] gh/guilhermeleobas/248/orig -> origin/gh/guilhermeleobas/248/orig 2025-12-04T08:56:59.2135656Z * [new branch] gh/guilhermeleobas/250/base -> origin/gh/guilhermeleobas/250/base 2025-12-04T08:56:59.2137355Z * [new branch] gh/guilhermeleobas/250/head -> origin/gh/guilhermeleobas/250/head 2025-12-04T08:56:59.2138804Z * [new branch] gh/guilhermeleobas/250/orig -> origin/gh/guilhermeleobas/250/orig 2025-12-04T08:56:59.2141332Z * [new branch] gh/guilhermeleobas/253/base -> origin/gh/guilhermeleobas/253/base 2025-12-04T08:56:59.2142921Z * [new branch] gh/guilhermeleobas/253/head -> origin/gh/guilhermeleobas/253/head 2025-12-04T08:56:59.2144544Z * [new branch] gh/guilhermeleobas/253/orig -> origin/gh/guilhermeleobas/253/orig 2025-12-04T08:56:59.2146642Z * [new branch] gh/guilhermeleobas/254/base -> origin/gh/guilhermeleobas/254/base 2025-12-04T08:56:59.2148230Z * [new branch] gh/guilhermeleobas/254/head -> origin/gh/guilhermeleobas/254/head 2025-12-04T08:56:59.2149812Z * [new branch] gh/guilhermeleobas/254/orig -> origin/gh/guilhermeleobas/254/orig 2025-12-04T08:56:59.2151879Z * [new branch] gh/guilhermeleobas/255/base -> origin/gh/guilhermeleobas/255/base 2025-12-04T08:56:59.2153526Z * [new branch] gh/guilhermeleobas/255/head -> origin/gh/guilhermeleobas/255/head 2025-12-04T08:56:59.2155099Z * [new branch] gh/guilhermeleobas/255/orig -> origin/gh/guilhermeleobas/255/orig 2025-12-04T08:56:59.2157251Z * [new branch] gh/guilhermeleobas/256/base -> origin/gh/guilhermeleobas/256/base 2025-12-04T08:56:59.2158968Z * [new branch] gh/guilhermeleobas/256/head -> origin/gh/guilhermeleobas/256/head 2025-12-04T08:56:59.2160646Z * [new branch] gh/guilhermeleobas/256/orig -> origin/gh/guilhermeleobas/256/orig 2025-12-04T08:56:59.2162775Z * [new branch] gh/guilhermeleobas/257/base -> origin/gh/guilhermeleobas/257/base 2025-12-04T08:56:59.2164742Z * [new branch] gh/guilhermeleobas/257/head -> origin/gh/guilhermeleobas/257/head 2025-12-04T08:56:59.2166269Z * [new branch] gh/guilhermeleobas/257/orig -> origin/gh/guilhermeleobas/257/orig 2025-12-04T08:56:59.2168386Z * [new branch] gh/guilhermeleobas/258/base -> origin/gh/guilhermeleobas/258/base 2025-12-04T08:56:59.2170192Z * [new branch] gh/guilhermeleobas/258/head -> origin/gh/guilhermeleobas/258/head 2025-12-04T08:56:59.2171348Z * [new branch] gh/guilhermeleobas/258/orig -> origin/gh/guilhermeleobas/258/orig 2025-12-04T08:56:59.2173695Z * [new branch] gh/guilhermeleobas/259/base -> origin/gh/guilhermeleobas/259/base 2025-12-04T08:56:59.2175323Z * [new branch] gh/guilhermeleobas/259/head -> origin/gh/guilhermeleobas/259/head 2025-12-04T08:56:59.2177339Z * [new branch] gh/guilhermeleobas/259/orig -> origin/gh/guilhermeleobas/259/orig 2025-12-04T08:56:59.2179492Z * [new branch] gh/guilhermeleobas/260/base -> origin/gh/guilhermeleobas/260/base 2025-12-04T08:56:59.2181139Z * [new branch] gh/guilhermeleobas/260/head -> origin/gh/guilhermeleobas/260/head 2025-12-04T08:56:59.2182645Z * [new branch] gh/guilhermeleobas/260/orig -> origin/gh/guilhermeleobas/260/orig 2025-12-04T08:56:59.2184757Z * [new branch] gh/guilhermeleobas/261/base -> origin/gh/guilhermeleobas/261/base 2025-12-04T08:56:59.2186515Z * [new branch] gh/guilhermeleobas/261/head -> origin/gh/guilhermeleobas/261/head 2025-12-04T08:56:59.2189488Z * [new branch] gh/guilhermeleobas/261/orig -> origin/gh/guilhermeleobas/261/orig 2025-12-04T08:56:59.2190632Z * [new branch] gh/guilhermeleobas/262/base -> origin/gh/guilhermeleobas/262/base 2025-12-04T08:56:59.2191898Z * [new branch] gh/guilhermeleobas/262/head -> origin/gh/guilhermeleobas/262/head 2025-12-04T08:56:59.2193521Z * [new branch] gh/guilhermeleobas/262/orig -> origin/gh/guilhermeleobas/262/orig 2025-12-04T08:56:59.2195622Z * [new branch] gh/guilhermeleobas/263/base -> origin/gh/guilhermeleobas/263/base 2025-12-04T08:56:59.2197926Z * [new branch] gh/guilhermeleobas/263/head -> origin/gh/guilhermeleobas/263/head 2025-12-04T08:56:59.2199459Z * [new branch] gh/guilhermeleobas/263/orig -> origin/gh/guilhermeleobas/263/orig 2025-12-04T08:56:59.2201495Z * [new branch] gh/guilhermeleobas/264/base -> origin/gh/guilhermeleobas/264/base 2025-12-04T08:56:59.2203046Z * [new branch] gh/guilhermeleobas/264/head -> origin/gh/guilhermeleobas/264/head 2025-12-04T08:56:59.2204584Z * [new branch] gh/guilhermeleobas/264/orig -> origin/gh/guilhermeleobas/264/orig 2025-12-04T08:56:59.2207126Z * [new branch] gh/guilhermeleobas/265/base -> origin/gh/guilhermeleobas/265/base 2025-12-04T08:56:59.2208228Z * [new branch] gh/guilhermeleobas/265/head -> origin/gh/guilhermeleobas/265/head 2025-12-04T08:56:59.2209894Z * [new branch] gh/guilhermeleobas/265/orig -> origin/gh/guilhermeleobas/265/orig 2025-12-04T08:56:59.2212140Z * [new branch] gh/guilhermeleobas/266/base -> origin/gh/guilhermeleobas/266/base 2025-12-04T08:56:59.2213897Z * [new branch] gh/guilhermeleobas/266/head -> origin/gh/guilhermeleobas/266/head 2025-12-04T08:56:59.2215541Z * [new branch] gh/guilhermeleobas/266/orig -> origin/gh/guilhermeleobas/266/orig 2025-12-04T08:56:59.2217687Z * [new branch] gh/guilhermeleobas/267/base -> origin/gh/guilhermeleobas/267/base 2025-12-04T08:56:59.2219323Z * [new branch] gh/guilhermeleobas/267/head -> origin/gh/guilhermeleobas/267/head 2025-12-04T08:56:59.2220879Z * [new branch] gh/guilhermeleobas/267/orig -> origin/gh/guilhermeleobas/267/orig 2025-12-04T08:56:59.2223520Z * [new branch] gh/hameerabbasi/1/base -> origin/gh/hameerabbasi/1/base 2025-12-04T08:56:59.2225458Z * [new branch] gh/hameerabbasi/1/head -> origin/gh/hameerabbasi/1/head 2025-12-04T08:56:59.2227502Z * [new branch] gh/hameerabbasi/2/base -> origin/gh/hameerabbasi/2/base 2025-12-04T08:56:59.2229083Z * [new branch] gh/hameerabbasi/2/head -> origin/gh/hameerabbasi/2/head 2025-12-04T08:56:59.2230662Z * [new branch] gh/hameerabbasi/2/orig -> origin/gh/hameerabbasi/2/orig 2025-12-04T08:56:59.2232627Z * [new branch] gh/hameerabbasi/3/base -> origin/gh/hameerabbasi/3/base 2025-12-04T08:56:59.2234224Z * [new branch] gh/hameerabbasi/3/head -> origin/gh/hameerabbasi/3/head 2025-12-04T08:56:59.2235873Z * [new branch] gh/hameerabbasi/3/orig -> origin/gh/hameerabbasi/3/orig 2025-12-04T08:56:59.2237848Z * [new branch] gh/hameerabbasi/4/base -> origin/gh/hameerabbasi/4/base 2025-12-04T08:56:59.2239512Z * [new branch] gh/hameerabbasi/4/head -> origin/gh/hameerabbasi/4/head 2025-12-04T08:56:59.2241015Z * [new branch] gh/hameerabbasi/4/orig -> origin/gh/hameerabbasi/4/orig 2025-12-04T08:56:59.2243532Z * [new branch] gh/huydhn/1/next -> origin/gh/huydhn/1/next 2025-12-04T08:56:59.2245431Z * [new branch] gh/huydhn/2/next -> origin/gh/huydhn/2/next 2025-12-04T08:56:59.2247492Z * [new branch] gh/huydhn/3/next -> origin/gh/huydhn/3/next 2025-12-04T08:56:59.2249646Z * [new branch] gh/huydhn/4/next -> origin/gh/huydhn/4/next 2025-12-04T08:56:59.2251625Z * [new branch] gh/huydhn/5/next -> origin/gh/huydhn/5/next 2025-12-04T08:56:59.2253853Z * [new branch] gh/huydhn/6/next -> origin/gh/huydhn/6/next 2025-12-04T08:56:59.2256430Z * [new branch] gh/int3/97/base -> origin/gh/int3/97/base 2025-12-04T08:56:59.2257892Z * [new branch] gh/int3/97/head -> origin/gh/int3/97/head 2025-12-04T08:56:59.2260462Z * [new branch] gh/isuruf/101/base -> origin/gh/isuruf/101/base 2025-12-04T08:56:59.2262408Z * [new branch] gh/isuruf/101/head -> origin/gh/isuruf/101/head 2025-12-04T08:56:59.2264340Z * [new branch] gh/isuruf/146/base -> origin/gh/isuruf/146/base 2025-12-04T08:56:59.2265836Z * [new branch] gh/isuruf/146/head -> origin/gh/isuruf/146/head 2025-12-04T08:56:59.2267438Z * [new branch] gh/isuruf/146/orig -> origin/gh/isuruf/146/orig 2025-12-04T08:56:59.2269517Z * [new branch] gh/isuruf/158/base -> origin/gh/isuruf/158/base 2025-12-04T08:56:59.2271064Z * [new branch] gh/isuruf/158/head -> origin/gh/isuruf/158/head 2025-12-04T08:56:59.2273084Z * [new branch] gh/isuruf/159/base -> origin/gh/isuruf/159/base 2025-12-04T08:56:59.2274781Z * [new branch] gh/isuruf/159/head -> origin/gh/isuruf/159/head 2025-12-04T08:56:59.2276850Z * [new branch] gh/isuruf/160/base -> origin/gh/isuruf/160/base 2025-12-04T08:56:59.2278413Z * [new branch] gh/isuruf/160/head -> origin/gh/isuruf/160/head 2025-12-04T08:56:59.2279934Z * [new branch] gh/isuruf/160/orig -> origin/gh/isuruf/160/orig 2025-12-04T08:56:59.2282083Z * [new branch] gh/isuruf/81/base -> origin/gh/isuruf/81/base 2025-12-04T08:56:59.2283611Z * [new branch] gh/isuruf/81/head -> origin/gh/isuruf/81/head 2025-12-04T08:56:59.2285127Z * [new branch] gh/isuruf/81/orig -> origin/gh/isuruf/81/orig 2025-12-04T08:56:59.2287660Z * [new branch] gh/jamesjwu/176/base -> origin/gh/jamesjwu/176/base 2025-12-04T08:56:59.2289198Z * [new branch] gh/jamesjwu/176/head -> origin/gh/jamesjwu/176/head 2025-12-04T08:56:59.2290759Z * [new branch] gh/jamesjwu/176/orig -> origin/gh/jamesjwu/176/orig 2025-12-04T08:56:59.2293008Z * [new branch] gh/jamesjwu/187/base -> origin/gh/jamesjwu/187/base 2025-12-04T08:56:59.2294596Z * [new branch] gh/jamesjwu/187/head -> origin/gh/jamesjwu/187/head 2025-12-04T08:56:59.2296119Z * [new branch] gh/jamesjwu/187/orig -> origin/gh/jamesjwu/187/orig 2025-12-04T08:56:59.2298412Z * [new branch] gh/jamesjwu/196/base -> origin/gh/jamesjwu/196/base 2025-12-04T08:56:59.2300024Z * [new branch] gh/jamesjwu/196/head -> origin/gh/jamesjwu/196/head 2025-12-04T08:56:59.2301511Z * [new branch] gh/jamesjwu/196/orig -> origin/gh/jamesjwu/196/orig 2025-12-04T08:56:59.2303522Z * [new branch] gh/jamesjwu/198/base -> origin/gh/jamesjwu/198/base 2025-12-04T08:56:59.2305052Z * [new branch] gh/jamesjwu/198/head -> origin/gh/jamesjwu/198/head 2025-12-04T08:56:59.2306597Z * [new branch] gh/jamesjwu/198/orig -> origin/gh/jamesjwu/198/orig 2025-12-04T08:56:59.2309169Z * [new branch] gh/jamesjwu/207/base -> origin/gh/jamesjwu/207/base 2025-12-04T08:56:59.2310883Z * [new branch] gh/jamesjwu/207/head -> origin/gh/jamesjwu/207/head 2025-12-04T08:56:59.2312844Z * [new branch] gh/jamesjwu/207/orig -> origin/gh/jamesjwu/207/orig 2025-12-04T08:56:59.2315084Z * [new branch] gh/jamesjwu/208/base -> origin/gh/jamesjwu/208/base 2025-12-04T08:56:59.2316725Z * [new branch] gh/jamesjwu/208/head -> origin/gh/jamesjwu/208/head 2025-12-04T08:56:59.2318305Z * [new branch] gh/jamesjwu/208/orig -> origin/gh/jamesjwu/208/orig 2025-12-04T08:56:59.2320510Z * [new branch] gh/jamesjwu/52/base -> origin/gh/jamesjwu/52/base 2025-12-04T08:56:59.2322019Z * [new branch] gh/jamesjwu/52/head -> origin/gh/jamesjwu/52/head 2025-12-04T08:56:59.2323921Z * [new branch] gh/jamesjwu/53/base -> origin/gh/jamesjwu/53/base 2025-12-04T08:56:59.2325476Z * [new branch] gh/jamesjwu/53/head -> origin/gh/jamesjwu/53/head 2025-12-04T08:56:59.2327554Z * [new branch] gh/jamesjwu/54/base -> origin/gh/jamesjwu/54/base 2025-12-04T08:56:59.2328947Z * [new branch] gh/jamesjwu/54/head -> origin/gh/jamesjwu/54/head 2025-12-04T08:56:59.2331081Z * [new branch] gh/jamesjwu/55/base -> origin/gh/jamesjwu/55/base 2025-12-04T08:56:59.2332659Z * [new branch] gh/jamesjwu/55/head -> origin/gh/jamesjwu/55/head 2025-12-04T08:56:59.2334758Z * [new branch] gh/jamesjwu/56/base -> origin/gh/jamesjwu/56/base 2025-12-04T08:56:59.2336320Z * [new branch] gh/jamesjwu/56/head -> origin/gh/jamesjwu/56/head 2025-12-04T08:56:59.2338309Z * [new branch] gh/jamesjwu/57/base -> origin/gh/jamesjwu/57/base 2025-12-04T08:56:59.2339792Z * [new branch] gh/jamesjwu/57/head -> origin/gh/jamesjwu/57/head 2025-12-04T08:56:59.2341777Z * [new branch] gh/jamesjwu/58/base -> origin/gh/jamesjwu/58/base 2025-12-04T08:56:59.2343259Z * [new branch] gh/jamesjwu/58/head -> origin/gh/jamesjwu/58/head 2025-12-04T08:56:59.2345245Z * [new branch] gh/jamesjwu/59/base -> origin/gh/jamesjwu/59/base 2025-12-04T08:56:59.2346770Z * [new branch] gh/jamesjwu/59/head -> origin/gh/jamesjwu/59/head 2025-12-04T08:56:59.2348709Z * [new branch] gh/jamesjwu/60/base -> origin/gh/jamesjwu/60/base 2025-12-04T08:56:59.2350278Z * [new branch] gh/jamesjwu/60/head -> origin/gh/jamesjwu/60/head 2025-12-04T08:56:59.2352235Z * [new branch] gh/jamesjwu/61/base -> origin/gh/jamesjwu/61/base 2025-12-04T08:56:59.2353787Z * [new branch] gh/jamesjwu/61/head -> origin/gh/jamesjwu/61/head 2025-12-04T08:56:59.2355784Z * [new branch] gh/jamesjwu/62/base -> origin/gh/jamesjwu/62/base 2025-12-04T08:56:59.2357289Z * [new branch] gh/jamesjwu/62/head -> origin/gh/jamesjwu/62/head 2025-12-04T08:56:59.2359240Z * [new branch] gh/jamesjwu/63/base -> origin/gh/jamesjwu/63/base 2025-12-04T08:56:59.2360809Z * [new branch] gh/jamesjwu/63/head -> origin/gh/jamesjwu/63/head 2025-12-04T08:56:59.2365336Z * [new branch] gh/jamesjwu/64/base -> origin/gh/jamesjwu/64/base 2025-12-04T08:56:59.2366899Z * [new branch] gh/jamesjwu/64/head -> origin/gh/jamesjwu/64/head 2025-12-04T08:56:59.2368956Z * [new branch] gh/jamesjwu/65/base -> origin/gh/jamesjwu/65/base 2025-12-04T08:56:59.2370440Z * [new branch] gh/jamesjwu/65/head -> origin/gh/jamesjwu/65/head 2025-12-04T08:56:59.2373121Z * [new branch] gh/janeyx99/165/base -> origin/gh/janeyx99/165/base 2025-12-04T08:56:59.2374693Z * [new branch] gh/janeyx99/165/head -> origin/gh/janeyx99/165/head 2025-12-04T08:56:59.2376370Z * [new branch] gh/janeyx99/165/orig -> origin/gh/janeyx99/165/orig 2025-12-04T08:56:59.2378301Z * [new branch] gh/janeyx99/201/base -> origin/gh/janeyx99/201/base 2025-12-04T08:56:59.2379851Z * [new branch] gh/janeyx99/201/head -> origin/gh/janeyx99/201/head 2025-12-04T08:56:59.2381377Z * [new branch] gh/janeyx99/201/orig -> origin/gh/janeyx99/201/orig 2025-12-04T08:56:59.2383683Z * [new branch] gh/janeyx99/225/base -> origin/gh/janeyx99/225/base 2025-12-04T08:56:59.2385223Z * [new branch] gh/janeyx99/225/head -> origin/gh/janeyx99/225/head 2025-12-04T08:56:59.2386801Z * [new branch] gh/janeyx99/225/orig -> origin/gh/janeyx99/225/orig 2025-12-04T08:56:59.2389005Z * [new branch] gh/janeyx99/299/base -> origin/gh/janeyx99/299/base 2025-12-04T08:56:59.2390557Z * [new branch] gh/janeyx99/299/head -> origin/gh/janeyx99/299/head 2025-12-04T08:56:59.2392244Z * [new branch] gh/janeyx99/299/orig -> origin/gh/janeyx99/299/orig 2025-12-04T08:56:59.2394502Z * [new branch] gh/janeyx99/302/base -> origin/gh/janeyx99/302/base 2025-12-04T08:56:59.2396016Z * [new branch] gh/janeyx99/302/head -> origin/gh/janeyx99/302/head 2025-12-04T08:56:59.2398047Z * [new branch] gh/janeyx99/303/base -> origin/gh/janeyx99/303/base 2025-12-04T08:56:59.2399526Z * [new branch] gh/janeyx99/303/head -> origin/gh/janeyx99/303/head 2025-12-04T08:56:59.2401649Z * [new branch] gh/janeyx99/305/base -> origin/gh/janeyx99/305/base 2025-12-04T08:56:59.2403266Z * [new branch] gh/janeyx99/305/head -> origin/gh/janeyx99/305/head 2025-12-04T08:56:59.2405187Z * [new branch] gh/janeyx99/306/base -> origin/gh/janeyx99/306/base 2025-12-04T08:56:59.2406683Z * [new branch] gh/janeyx99/306/head -> origin/gh/janeyx99/306/head 2025-12-04T08:56:59.2408771Z * [new branch] gh/janeyx99/314/base -> origin/gh/janeyx99/314/base 2025-12-04T08:56:59.2410369Z * [new branch] gh/janeyx99/314/head -> origin/gh/janeyx99/314/head 2025-12-04T08:56:59.2411906Z * [new branch] gh/janeyx99/314/orig -> origin/gh/janeyx99/314/orig 2025-12-04T08:56:59.2414176Z * [new branch] gh/janeyx99/315/base -> origin/gh/janeyx99/315/base 2025-12-04T08:56:59.2415704Z * [new branch] gh/janeyx99/315/head -> origin/gh/janeyx99/315/head 2025-12-04T08:56:59.2417303Z * [new branch] gh/janeyx99/315/orig -> origin/gh/janeyx99/315/orig 2025-12-04T08:56:59.2419433Z * [new branch] gh/janeyx99/316/base -> origin/gh/janeyx99/316/base 2025-12-04T08:56:59.2420987Z * [new branch] gh/janeyx99/316/head -> origin/gh/janeyx99/316/head 2025-12-04T08:56:59.2422519Z * [new branch] gh/janeyx99/316/orig -> origin/gh/janeyx99/316/orig 2025-12-04T08:56:59.2424719Z * [new branch] gh/janeyx99/317/base -> origin/gh/janeyx99/317/base 2025-12-04T08:56:59.2426249Z * [new branch] gh/janeyx99/317/head -> origin/gh/janeyx99/317/head 2025-12-04T08:56:59.2427794Z * [new branch] gh/janeyx99/317/orig -> origin/gh/janeyx99/317/orig 2025-12-04T08:56:59.2430441Z * [new branch] gh/janeyx99/325/base -> origin/gh/janeyx99/325/base 2025-12-04T08:56:59.2431985Z * [new branch] gh/janeyx99/325/head -> origin/gh/janeyx99/325/head 2025-12-04T08:56:59.2433514Z * [new branch] gh/janeyx99/325/orig -> origin/gh/janeyx99/325/orig 2025-12-04T08:56:59.2435602Z * [new branch] gh/janeyx99/327/base -> origin/gh/janeyx99/327/base 2025-12-04T08:56:59.2437148Z * [new branch] gh/janeyx99/327/head -> origin/gh/janeyx99/327/head 2025-12-04T08:56:59.2438758Z * [new branch] gh/janeyx99/327/orig -> origin/gh/janeyx99/327/orig 2025-12-04T08:56:59.2440890Z * [new branch] gh/janeyx99/328/base -> origin/gh/janeyx99/328/base 2025-12-04T08:56:59.2442490Z * [new branch] gh/janeyx99/328/head -> origin/gh/janeyx99/328/head 2025-12-04T08:56:59.2444149Z * [new branch] gh/janeyx99/328/orig -> origin/gh/janeyx99/328/orig 2025-12-04T08:56:59.2446116Z * [new branch] gh/janeyx99/329/base -> origin/gh/janeyx99/329/base 2025-12-04T08:56:59.2447655Z * [new branch] gh/janeyx99/329/head -> origin/gh/janeyx99/329/head 2025-12-04T08:56:59.2449232Z * [new branch] gh/janeyx99/329/orig -> origin/gh/janeyx99/329/orig 2025-12-04T08:56:59.2451798Z * [new branch] gh/janeyx99/330/base -> origin/gh/janeyx99/330/base 2025-12-04T08:56:59.2453502Z * [new branch] gh/janeyx99/330/head -> origin/gh/janeyx99/330/head 2025-12-04T08:56:59.2455089Z * [new branch] gh/janeyx99/330/orig -> origin/gh/janeyx99/330/orig 2025-12-04T08:56:59.2457204Z * [new branch] gh/janeyx99/331/base -> origin/gh/janeyx99/331/base 2025-12-04T08:56:59.2458672Z * [new branch] gh/janeyx99/331/head -> origin/gh/janeyx99/331/head 2025-12-04T08:56:59.2460251Z * [new branch] gh/janeyx99/331/orig -> origin/gh/janeyx99/331/orig 2025-12-04T08:56:59.2462621Z * [new branch] gh/janeyx99/332/base -> origin/gh/janeyx99/332/base 2025-12-04T08:56:59.2464798Z * [new branch] gh/janeyx99/332/head -> origin/gh/janeyx99/332/head 2025-12-04T08:56:59.2467100Z * [new branch] gh/janeyx99/332/orig -> origin/gh/janeyx99/332/orig 2025-12-04T08:56:59.2469977Z * [new branch] gh/janeyx99/333/base -> origin/gh/janeyx99/333/base 2025-12-04T08:56:59.2472183Z * [new branch] gh/janeyx99/333/head -> origin/gh/janeyx99/333/head 2025-12-04T08:56:59.2474444Z * [new branch] gh/janeyx99/333/orig -> origin/gh/janeyx99/333/orig 2025-12-04T08:56:59.2477351Z * [new branch] gh/janeyx99/88/base -> origin/gh/janeyx99/88/base 2025-12-04T08:56:59.2478780Z * [new branch] gh/janeyx99/88/head -> origin/gh/janeyx99/88/head 2025-12-04T08:56:59.2480329Z * [new branch] gh/janeyx99/88/orig -> origin/gh/janeyx99/88/orig 2025-12-04T08:56:59.2482894Z * [new branch] gh/jansel/360/base -> origin/gh/jansel/360/base 2025-12-04T08:56:59.2484572Z * [new branch] gh/jansel/360/head -> origin/gh/jansel/360/head 2025-12-04T08:56:59.2486656Z * [new branch] gh/jansel/451/base -> origin/gh/jansel/451/base 2025-12-04T08:56:59.2488222Z * [new branch] gh/jansel/451/head -> origin/gh/jansel/451/head 2025-12-04T08:56:59.2489976Z * [new branch] gh/jansel/451/orig -> origin/gh/jansel/451/orig 2025-12-04T08:56:59.2492029Z * [new branch] gh/jansel/462/base -> origin/gh/jansel/462/base 2025-12-04T08:56:59.2493768Z * [new branch] gh/jansel/462/head -> origin/gh/jansel/462/head 2025-12-04T08:56:59.2495291Z * [new branch] gh/jansel/462/orig -> origin/gh/jansel/462/orig 2025-12-04T08:56:59.2497380Z * [new branch] gh/jansel/533/base -> origin/gh/jansel/533/base 2025-12-04T08:56:59.2498924Z * [new branch] gh/jansel/533/head -> origin/gh/jansel/533/head 2025-12-04T08:56:59.2500497Z * [new branch] gh/jansel/533/orig -> origin/gh/jansel/533/orig 2025-12-04T08:56:59.2502587Z * [new branch] gh/jansel/552/base -> origin/gh/jansel/552/base 2025-12-04T08:56:59.2504144Z * [new branch] gh/jansel/552/head -> origin/gh/jansel/552/head 2025-12-04T08:56:59.2505768Z * [new branch] gh/jansel/552/orig -> origin/gh/jansel/552/orig 2025-12-04T08:56:59.2507871Z * [new branch] gh/jansel/553/base -> origin/gh/jansel/553/base 2025-12-04T08:56:59.2509399Z * [new branch] gh/jansel/553/head -> origin/gh/jansel/553/head 2025-12-04T08:56:59.2510924Z * [new branch] gh/jansel/553/orig -> origin/gh/jansel/553/orig 2025-12-04T08:56:59.2512991Z * [new branch] gh/jansel/554/base -> origin/gh/jansel/554/base 2025-12-04T08:56:59.2514527Z * [new branch] gh/jansel/554/head -> origin/gh/jansel/554/head 2025-12-04T08:56:59.2516028Z * [new branch] gh/jansel/554/orig -> origin/gh/jansel/554/orig 2025-12-04T08:56:59.2518057Z * [new branch] gh/jansel/555/base -> origin/gh/jansel/555/base 2025-12-04T08:56:59.2519580Z * [new branch] gh/jansel/555/head -> origin/gh/jansel/555/head 2025-12-04T08:56:59.2521271Z * [new branch] gh/jansel/555/orig -> origin/gh/jansel/555/orig 2025-12-04T08:56:59.2523315Z * [new branch] gh/jansel/556/base -> origin/gh/jansel/556/base 2025-12-04T08:56:59.2524811Z * [new branch] gh/jansel/556/head -> origin/gh/jansel/556/head 2025-12-04T08:56:59.2526434Z * [new branch] gh/jansel/556/orig -> origin/gh/jansel/556/orig 2025-12-04T08:56:59.2528533Z * [new branch] gh/jansel/557/base -> origin/gh/jansel/557/base 2025-12-04T08:56:59.2530026Z * [new branch] gh/jansel/557/head -> origin/gh/jansel/557/head 2025-12-04T08:56:59.2531566Z * [new branch] gh/jansel/557/orig -> origin/gh/jansel/557/orig 2025-12-04T08:56:59.2533914Z * [new branch] gh/jansel/558/base -> origin/gh/jansel/558/base 2025-12-04T08:56:59.2535813Z * [new branch] gh/jansel/558/head -> origin/gh/jansel/558/head 2025-12-04T08:56:59.2537327Z * [new branch] gh/jansel/558/orig -> origin/gh/jansel/558/orig 2025-12-04T08:56:59.2539405Z * [new branch] gh/jansel/559/base -> origin/gh/jansel/559/base 2025-12-04T08:56:59.2540948Z * [new branch] gh/jansel/559/head -> origin/gh/jansel/559/head 2025-12-04T08:56:59.2542482Z * [new branch] gh/jansel/559/orig -> origin/gh/jansel/559/orig 2025-12-04T08:56:59.2544549Z * [new branch] gh/jansel/560/base -> origin/gh/jansel/560/base 2025-12-04T08:56:59.2546140Z * [new branch] gh/jansel/560/head -> origin/gh/jansel/560/head 2025-12-04T08:56:59.2548188Z * [new branch] gh/jansel/560/orig -> origin/gh/jansel/560/orig 2025-12-04T08:56:59.2550331Z * [new branch] gh/jansel/561/base -> origin/gh/jansel/561/base 2025-12-04T08:56:59.2551850Z * [new branch] gh/jansel/561/head -> origin/gh/jansel/561/head 2025-12-04T08:56:59.2553385Z * [new branch] gh/jansel/561/orig -> origin/gh/jansel/561/orig 2025-12-04T08:56:59.2555613Z * [new branch] gh/jansel/562/base -> origin/gh/jansel/562/base 2025-12-04T08:56:59.2557483Z * [new branch] gh/jansel/562/head -> origin/gh/jansel/562/head 2025-12-04T08:56:59.2559094Z * [new branch] gh/jansel/562/orig -> origin/gh/jansel/562/orig 2025-12-04T08:56:59.2561128Z * [new branch] gh/jansel/563/base -> origin/gh/jansel/563/base 2025-12-04T08:56:59.2562978Z * [new branch] gh/jansel/563/head -> origin/gh/jansel/563/head 2025-12-04T08:56:59.2564555Z * [new branch] gh/jansel/563/orig -> origin/gh/jansel/563/orig 2025-12-04T08:56:59.2567190Z * [new branch] gh/jansel/564/base -> origin/gh/jansel/564/base 2025-12-04T08:56:59.2568754Z * [new branch] gh/jansel/564/head -> origin/gh/jansel/564/head 2025-12-04T08:56:59.2570380Z * [new branch] gh/jansel/564/orig -> origin/gh/jansel/564/orig 2025-12-04T08:56:59.2572565Z * [new branch] gh/jansel/565/base -> origin/gh/jansel/565/base 2025-12-04T08:56:59.2574138Z * [new branch] gh/jansel/565/head -> origin/gh/jansel/565/head 2025-12-04T08:56:59.2575694Z * [new branch] gh/jansel/565/orig -> origin/gh/jansel/565/orig 2025-12-04T08:56:59.2577903Z * [new branch] gh/jansel/566/base -> origin/gh/jansel/566/base 2025-12-04T08:56:59.2579419Z * [new branch] gh/jansel/566/head -> origin/gh/jansel/566/head 2025-12-04T08:56:59.2580961Z * [new branch] gh/jansel/566/orig -> origin/gh/jansel/566/orig 2025-12-04T08:56:59.2583688Z * [new branch] gh/jansel/567/base -> origin/gh/jansel/567/base 2025-12-04T08:56:59.2585298Z * [new branch] gh/jansel/567/head -> origin/gh/jansel/567/head 2025-12-04T08:56:59.2587009Z * [new branch] gh/jansel/567/orig -> origin/gh/jansel/567/orig 2025-12-04T08:56:59.2589093Z * [new branch] gh/jansel/568/base -> origin/gh/jansel/568/base 2025-12-04T08:56:59.2590632Z * [new branch] gh/jansel/568/head -> origin/gh/jansel/568/head 2025-12-04T08:56:59.2592256Z * [new branch] gh/jansel/568/orig -> origin/gh/jansel/568/orig 2025-12-04T08:56:59.2594264Z * [new branch] gh/jansel/569/base -> origin/gh/jansel/569/base 2025-12-04T08:56:59.2595812Z * [new branch] gh/jansel/569/head -> origin/gh/jansel/569/head 2025-12-04T08:56:59.2597369Z * [new branch] gh/jansel/569/orig -> origin/gh/jansel/569/orig 2025-12-04T08:56:59.2599486Z * [new branch] gh/jansel/570/base -> origin/gh/jansel/570/base 2025-12-04T08:56:59.2601015Z * [new branch] gh/jansel/570/head -> origin/gh/jansel/570/head 2025-12-04T08:56:59.2602592Z * [new branch] gh/jansel/570/orig -> origin/gh/jansel/570/orig 2025-12-04T08:56:59.2604795Z * [new branch] gh/jansel/571/base -> origin/gh/jansel/571/base 2025-12-04T08:56:59.2606309Z * [new branch] gh/jansel/571/head -> origin/gh/jansel/571/head 2025-12-04T08:56:59.2607815Z * [new branch] gh/jansel/571/orig -> origin/gh/jansel/571/orig 2025-12-04T08:56:59.2609897Z * [new branch] gh/jansel/572/base -> origin/gh/jansel/572/base 2025-12-04T08:56:59.2611475Z * [new branch] gh/jansel/572/head -> origin/gh/jansel/572/head 2025-12-04T08:56:59.2613239Z * [new branch] gh/jansel/572/orig -> origin/gh/jansel/572/orig 2025-12-04T08:56:59.2615402Z * [new branch] gh/jansel/573/base -> origin/gh/jansel/573/base 2025-12-04T08:56:59.2617015Z * [new branch] gh/jansel/573/head -> origin/gh/jansel/573/head 2025-12-04T08:56:59.2618620Z * [new branch] gh/jansel/573/orig -> origin/gh/jansel/573/orig 2025-12-04T08:56:59.2620731Z * [new branch] gh/jansel/574/base -> origin/gh/jansel/574/base 2025-12-04T08:56:59.2622223Z * [new branch] gh/jansel/574/head -> origin/gh/jansel/574/head 2025-12-04T08:56:59.2623774Z * [new branch] gh/jansel/574/orig -> origin/gh/jansel/574/orig 2025-12-04T08:56:59.2626071Z * [new branch] gh/jansel/575/base -> origin/gh/jansel/575/base 2025-12-04T08:56:59.2627583Z * [new branch] gh/jansel/575/head -> origin/gh/jansel/575/head 2025-12-04T08:56:59.2629107Z * [new branch] gh/jansel/575/orig -> origin/gh/jansel/575/orig 2025-12-04T08:56:59.2631278Z * [new branch] gh/jansel/576/base -> origin/gh/jansel/576/base 2025-12-04T08:56:59.2632840Z * [new branch] gh/jansel/576/head -> origin/gh/jansel/576/head 2025-12-04T08:56:59.2634928Z * [new branch] gh/jansel/576/orig -> origin/gh/jansel/576/orig 2025-12-04T08:56:59.2637476Z * [new branch] gh/jbschlosser/247/base -> origin/gh/jbschlosser/247/base 2025-12-04T08:56:59.2639021Z * [new branch] gh/jbschlosser/247/head -> origin/gh/jbschlosser/247/head 2025-12-04T08:56:59.2640537Z * [new branch] gh/jbschlosser/247/orig -> origin/gh/jbschlosser/247/orig 2025-12-04T08:56:59.2642704Z * [new branch] gh/jbschlosser/250/base -> origin/gh/jbschlosser/250/base 2025-12-04T08:56:59.2644172Z * [new branch] gh/jbschlosser/250/head -> origin/gh/jbschlosser/250/head 2025-12-04T08:56:59.2645729Z * [new branch] gh/jbschlosser/250/orig -> origin/gh/jbschlosser/250/orig 2025-12-04T08:56:59.2648300Z * [new branch] gh/jerryzh168/1/base -> origin/gh/jerryzh168/1/base 2025-12-04T08:56:59.2649944Z * [new branch] gh/jerryzh168/1/head -> origin/gh/jerryzh168/1/head 2025-12-04T08:56:59.2651284Z * [new branch] gh/jerryzh168/1/orig -> origin/gh/jerryzh168/1/orig 2025-12-04T08:56:59.2653975Z * [new branch] gh/jiayisunx/59/base -> origin/gh/jiayisunx/59/base 2025-12-04T08:56:59.2655525Z * [new branch] gh/jiayisunx/59/head -> origin/gh/jiayisunx/59/head 2025-12-04T08:56:59.2657085Z * [new branch] gh/jiayisunx/59/orig -> origin/gh/jiayisunx/59/orig 2025-12-04T08:56:59.2659085Z * [new branch] gh/jiayisunx/61/base -> origin/gh/jiayisunx/61/base 2025-12-04T08:56:59.2660637Z * [new branch] gh/jiayisunx/61/head -> origin/gh/jiayisunx/61/head 2025-12-04T08:56:59.2662167Z * [new branch] gh/jiayisunx/61/orig -> origin/gh/jiayisunx/61/orig 2025-12-04T08:56:59.2664512Z * [new branch] gh/jiayisunx/68/base -> origin/gh/jiayisunx/68/base 2025-12-04T08:56:59.2666072Z * [new branch] gh/jiayisunx/68/head -> origin/gh/jiayisunx/68/head 2025-12-04T08:56:59.2667810Z * [new branch] gh/jiayisunx/68/orig -> origin/gh/jiayisunx/68/orig 2025-12-04T08:56:59.2669972Z * [new branch] gh/jiayisunx/77/base -> origin/gh/jiayisunx/77/base 2025-12-04T08:56:59.2671304Z * [new branch] gh/jiayisunx/77/head -> origin/gh/jiayisunx/77/head 2025-12-04T08:56:59.2673056Z * [new branch] gh/jiayisunx/77/orig -> origin/gh/jiayisunx/77/orig 2025-12-04T08:56:59.2675492Z * [new branch] gh/jiayisunx/78/base -> origin/gh/jiayisunx/78/base 2025-12-04T08:56:59.2677000Z * [new branch] gh/jiayisunx/78/head -> origin/gh/jiayisunx/78/head 2025-12-04T08:56:59.2678728Z * [new branch] gh/jiayisunx/78/orig -> origin/gh/jiayisunx/78/orig 2025-12-04T08:56:59.2680634Z * [new branch] gh/jiayisunx/79/base -> origin/gh/jiayisunx/79/base 2025-12-04T08:56:59.2682269Z * [new branch] gh/jiayisunx/79/head -> origin/gh/jiayisunx/79/head 2025-12-04T08:56:59.2683731Z * [new branch] gh/jiayisunx/79/orig -> origin/gh/jiayisunx/79/orig 2025-12-04T08:56:59.2686452Z * [new branch] gh/jiayisunx/82/base -> origin/gh/jiayisunx/82/base 2025-12-04T08:56:59.2687910Z * [new branch] gh/jiayisunx/82/head -> origin/gh/jiayisunx/82/head 2025-12-04T08:56:59.2689601Z * [new branch] gh/jiayisunx/82/orig -> origin/gh/jiayisunx/82/orig 2025-12-04T08:56:59.2691585Z * [new branch] gh/jiayisunx/83/base -> origin/gh/jiayisunx/83/base 2025-12-04T08:56:59.2693346Z * [new branch] gh/jiayisunx/83/head -> origin/gh/jiayisunx/83/head 2025-12-04T08:56:59.2694820Z * [new branch] gh/jiayisunx/83/orig -> origin/gh/jiayisunx/83/orig 2025-12-04T08:56:59.2697014Z * [new branch] gh/jiayisunx/84/base -> origin/gh/jiayisunx/84/base 2025-12-04T08:56:59.2698458Z * [new branch] gh/jiayisunx/84/head -> origin/gh/jiayisunx/84/head 2025-12-04T08:56:59.2700040Z * [new branch] gh/jiayisunx/84/orig -> origin/gh/jiayisunx/84/orig 2025-12-04T08:56:59.2702046Z * [new branch] gh/jiayisunx/85/base -> origin/gh/jiayisunx/85/base 2025-12-04T08:56:59.2703684Z * [new branch] gh/jiayisunx/85/head -> origin/gh/jiayisunx/85/head 2025-12-04T08:56:59.2704985Z * [new branch] gh/jiayisunx/85/orig -> origin/gh/jiayisunx/85/orig 2025-12-04T08:56:59.2707106Z * [new branch] gh/jiayisunx/86/base -> origin/gh/jiayisunx/86/base 2025-12-04T08:56:59.2708741Z * [new branch] gh/jiayisunx/86/head -> origin/gh/jiayisunx/86/head 2025-12-04T08:56:59.2710341Z * [new branch] gh/jiayisunx/86/orig -> origin/gh/jiayisunx/86/orig 2025-12-04T08:56:59.2712579Z * [new branch] gh/jiayisunx/87/base -> origin/gh/jiayisunx/87/base 2025-12-04T08:56:59.2713870Z * [new branch] gh/jiayisunx/87/head -> origin/gh/jiayisunx/87/head 2025-12-04T08:56:59.2715490Z * [new branch] gh/jiayisunx/87/orig -> origin/gh/jiayisunx/87/orig 2025-12-04T08:56:59.2717646Z * [new branch] gh/jiayisunx/88/base -> origin/gh/jiayisunx/88/base 2025-12-04T08:56:59.2719219Z * [new branch] gh/jiayisunx/88/head -> origin/gh/jiayisunx/88/head 2025-12-04T08:56:59.2720770Z * [new branch] gh/jiayisunx/88/orig -> origin/gh/jiayisunx/88/orig 2025-12-04T08:56:59.2722845Z * [new branch] gh/jiayisunx/89/base -> origin/gh/jiayisunx/89/base 2025-12-04T08:56:59.2724493Z * [new branch] gh/jiayisunx/89/head -> origin/gh/jiayisunx/89/head 2025-12-04T08:56:59.2725929Z * [new branch] gh/jiayisunx/89/orig -> origin/gh/jiayisunx/89/orig 2025-12-04T08:56:59.2728100Z * [new branch] gh/jiayisunx/90/base -> origin/gh/jiayisunx/90/base 2025-12-04T08:56:59.2729687Z * [new branch] gh/jiayisunx/90/head -> origin/gh/jiayisunx/90/head 2025-12-04T08:56:59.2731244Z * [new branch] gh/jiayisunx/90/orig -> origin/gh/jiayisunx/90/orig 2025-12-04T08:56:59.2733793Z * [new branch] gh/jjwu@meta.com/1/base -> origin/gh/jjwu@meta.com/1/base 2025-12-04T08:56:59.2735385Z * [new branch] gh/jjwu@meta.com/1/head -> origin/gh/jjwu@meta.com/1/head 2025-12-04T08:56:59.2738322Z * [new branch] gh/jturney/1/base -> origin/gh/jturney/1/base 2025-12-04T08:56:59.2739879Z * [new branch] gh/jturney/1/head -> origin/gh/jturney/1/head 2025-12-04T08:56:59.2741212Z * [new branch] gh/jturney/1/orig -> origin/gh/jturney/1/orig 2025-12-04T08:56:59.2743378Z * [new branch] gh/jturney/2/base -> origin/gh/jturney/2/base 2025-12-04T08:56:59.2745033Z * [new branch] gh/jturney/2/head -> origin/gh/jturney/2/head 2025-12-04T08:56:59.2746436Z * [new branch] gh/jturney/2/orig -> origin/gh/jturney/2/orig 2025-12-04T08:56:59.2749096Z * [new branch] gh/karthickai/10/base -> origin/gh/karthickai/10/base 2025-12-04T08:56:59.2750788Z * [new branch] gh/karthickai/10/head -> origin/gh/karthickai/10/head 2025-12-04T08:56:59.2752255Z * [new branch] gh/karthickai/10/orig -> origin/gh/karthickai/10/orig 2025-12-04T08:56:59.2754408Z * [new branch] gh/karthickai/11/base -> origin/gh/karthickai/11/base 2025-12-04T08:56:59.2756057Z * [new branch] gh/karthickai/11/head -> origin/gh/karthickai/11/head 2025-12-04T08:56:59.2757616Z * [new branch] gh/karthickai/11/orig -> origin/gh/karthickai/11/orig 2025-12-04T08:56:59.2760019Z * [new branch] gh/karthickai/12/base -> origin/gh/karthickai/12/base 2025-12-04T08:56:59.2761659Z * [new branch] gh/karthickai/12/head -> origin/gh/karthickai/12/head 2025-12-04T08:56:59.2765165Z * [new branch] gh/karthickai/12/orig -> origin/gh/karthickai/12/orig 2025-12-04T08:56:59.2767401Z * [new branch] gh/karthickai/13/base -> origin/gh/karthickai/13/base 2025-12-04T08:56:59.2768877Z * [new branch] gh/karthickai/13/head -> origin/gh/karthickai/13/head 2025-12-04T08:56:59.2770437Z * [new branch] gh/karthickai/13/orig -> origin/gh/karthickai/13/orig 2025-12-04T08:56:59.2772719Z * [new branch] gh/karthickai/14/base -> origin/gh/karthickai/14/base 2025-12-04T08:56:59.2774408Z * [new branch] gh/karthickai/14/head -> origin/gh/karthickai/14/head 2025-12-04T08:56:59.2776019Z * [new branch] gh/karthickai/14/orig -> origin/gh/karthickai/14/orig 2025-12-04T08:56:59.2778433Z * [new branch] gh/karthickai/15/base -> origin/gh/karthickai/15/base 2025-12-04T08:56:59.2780466Z * [new branch] gh/karthickai/15/head -> origin/gh/karthickai/15/head 2025-12-04T08:56:59.2782029Z * [new branch] gh/karthickai/15/orig -> origin/gh/karthickai/15/orig 2025-12-04T08:56:59.2784051Z * [new branch] gh/karthickai/16/base -> origin/gh/karthickai/16/base 2025-12-04T08:56:59.2785630Z * [new branch] gh/karthickai/16/head -> origin/gh/karthickai/16/head 2025-12-04T08:56:59.2787226Z * [new branch] gh/karthickai/16/orig -> origin/gh/karthickai/16/orig 2025-12-04T08:56:59.2789205Z * [new branch] gh/karthickai/17/base -> origin/gh/karthickai/17/base 2025-12-04T08:56:59.2790824Z * [new branch] gh/karthickai/17/head -> origin/gh/karthickai/17/head 2025-12-04T08:56:59.2792296Z * [new branch] gh/karthickai/17/orig -> origin/gh/karthickai/17/orig 2025-12-04T08:56:59.2794912Z * [new branch] gh/karthickai/18/base -> origin/gh/karthickai/18/base 2025-12-04T08:56:59.2796662Z * [new branch] gh/karthickai/18/head -> origin/gh/karthickai/18/head 2025-12-04T08:56:59.2798328Z * [new branch] gh/karthickai/18/orig -> origin/gh/karthickai/18/orig 2025-12-04T08:56:59.2800620Z * [new branch] gh/karthickai/19/base -> origin/gh/karthickai/19/base 2025-12-04T08:56:59.2802250Z * [new branch] gh/karthickai/19/head -> origin/gh/karthickai/19/head 2025-12-04T08:56:59.2803793Z * [new branch] gh/karthickai/19/orig -> origin/gh/karthickai/19/orig 2025-12-04T08:56:59.2806491Z * [new branch] gh/karthickai/20/base -> origin/gh/karthickai/20/base 2025-12-04T08:56:59.2808474Z * [new branch] gh/karthickai/20/head -> origin/gh/karthickai/20/head 2025-12-04T08:56:59.2810077Z * [new branch] gh/karthickai/20/orig -> origin/gh/karthickai/20/orig 2025-12-04T08:56:59.2812161Z * [new branch] gh/karthickai/21/base -> origin/gh/karthickai/21/base 2025-12-04T08:56:59.2814084Z * [new branch] gh/karthickai/21/head -> origin/gh/karthickai/21/head 2025-12-04T08:56:59.2815560Z * [new branch] gh/karthickai/21/orig -> origin/gh/karthickai/21/orig 2025-12-04T08:56:59.2817707Z * [new branch] gh/karthickai/22/base -> origin/gh/karthickai/22/base 2025-12-04T08:56:59.2819190Z * [new branch] gh/karthickai/22/head -> origin/gh/karthickai/22/head 2025-12-04T08:56:59.2820810Z * [new branch] gh/karthickai/22/orig -> origin/gh/karthickai/22/orig 2025-12-04T08:56:59.2822989Z * [new branch] gh/karthickai/23/base -> origin/gh/karthickai/23/base 2025-12-04T08:56:59.2824768Z * [new branch] gh/karthickai/23/head -> origin/gh/karthickai/23/head 2025-12-04T08:56:59.2826349Z * [new branch] gh/karthickai/23/orig -> origin/gh/karthickai/23/orig 2025-12-04T08:56:59.2828392Z * [new branch] gh/karthickai/24/base -> origin/gh/karthickai/24/base 2025-12-04T08:56:59.2829929Z * [new branch] gh/karthickai/24/head -> origin/gh/karthickai/24/head 2025-12-04T08:56:59.2831624Z * [new branch] gh/karthickai/24/orig -> origin/gh/karthickai/24/orig 2025-12-04T08:56:59.2834147Z * [new branch] gh/karthickai/25/base -> origin/gh/karthickai/25/base 2025-12-04T08:56:59.2835781Z * [new branch] gh/karthickai/25/head -> origin/gh/karthickai/25/head 2025-12-04T08:56:59.2837318Z * [new branch] gh/karthickai/25/orig -> origin/gh/karthickai/25/orig 2025-12-04T08:56:59.2839262Z * [new branch] gh/karthickai/26/base -> origin/gh/karthickai/26/base 2025-12-04T08:56:59.2840964Z * [new branch] gh/karthickai/26/head -> origin/gh/karthickai/26/head 2025-12-04T08:56:59.2842536Z * [new branch] gh/karthickai/26/orig -> origin/gh/karthickai/26/orig 2025-12-04T08:56:59.2845713Z * [new branch] gh/karthickai/6/base -> origin/gh/karthickai/6/base 2025-12-04T08:56:59.2847736Z * [new branch] gh/karthickai/6/head -> origin/gh/karthickai/6/head 2025-12-04T08:56:59.2849349Z * [new branch] gh/karthickai/6/orig -> origin/gh/karthickai/6/orig 2025-12-04T08:56:59.2851828Z * [new branch] gh/krocki/1/base -> origin/gh/krocki/1/base 2025-12-04T08:56:59.2853536Z * [new branch] gh/krocki/1/head -> origin/gh/krocki/1/head 2025-12-04T08:56:59.2855084Z * [new branch] gh/krocki/1/orig -> origin/gh/krocki/1/orig 2025-12-04T08:56:59.2857184Z * [new branch] gh/krocki/2/base -> origin/gh/krocki/2/base 2025-12-04T08:56:59.2858714Z * [new branch] gh/krocki/2/head -> origin/gh/krocki/2/head 2025-12-04T08:56:59.2860147Z * [new branch] gh/krocki/2/orig -> origin/gh/krocki/2/orig 2025-12-04T08:56:59.2862907Z * [new branch] gh/kurtamohler/60/base -> origin/gh/kurtamohler/60/base 2025-12-04T08:56:59.2864725Z * [new branch] gh/kurtamohler/60/head -> origin/gh/kurtamohler/60/head 2025-12-04T08:56:59.2866360Z * [new branch] gh/kurtamohler/60/orig -> origin/gh/kurtamohler/60/orig 2025-12-04T08:56:59.2869006Z * [new branch] gh/kurtamohler/61/base -> origin/gh/kurtamohler/61/base 2025-12-04T08:56:59.2870672Z * [new branch] gh/kurtamohler/61/head -> origin/gh/kurtamohler/61/head 2025-12-04T08:56:59.2872233Z * [new branch] gh/kurtamohler/61/orig -> origin/gh/kurtamohler/61/orig 2025-12-04T08:56:59.2874333Z * [new branch] gh/kurtamohler/62/base -> origin/gh/kurtamohler/62/base 2025-12-04T08:56:59.2875798Z * [new branch] gh/kurtamohler/62/head -> origin/gh/kurtamohler/62/head 2025-12-04T08:56:59.2877403Z * [new branch] gh/kurtamohler/62/orig -> origin/gh/kurtamohler/62/orig 2025-12-04T08:56:59.2880064Z * [new branch] gh/kurtamohler/63/base -> origin/gh/kurtamohler/63/base 2025-12-04T08:56:59.2881670Z * [new branch] gh/kurtamohler/63/head -> origin/gh/kurtamohler/63/head 2025-12-04T08:56:59.2883173Z * [new branch] gh/kurtamohler/63/orig -> origin/gh/kurtamohler/63/orig 2025-12-04T08:56:59.2885155Z * [new branch] gh/kurtamohler/64/base -> origin/gh/kurtamohler/64/base 2025-12-04T08:56:59.2886707Z * [new branch] gh/kurtamohler/64/head -> origin/gh/kurtamohler/64/head 2025-12-04T08:56:59.2888258Z * [new branch] gh/kurtamohler/64/orig -> origin/gh/kurtamohler/64/orig 2025-12-04T08:56:59.2890413Z * [new branch] gh/kurtamohler/65/base -> origin/gh/kurtamohler/65/base 2025-12-04T08:56:59.2892010Z * [new branch] gh/kurtamohler/65/head -> origin/gh/kurtamohler/65/head 2025-12-04T08:56:59.2893669Z * [new branch] gh/kurtamohler/65/orig -> origin/gh/kurtamohler/65/orig 2025-12-04T08:56:59.2895700Z * [new branch] gh/kurtamohler/66/base -> origin/gh/kurtamohler/66/base 2025-12-04T08:56:59.2897248Z * [new branch] gh/kurtamohler/66/head -> origin/gh/kurtamohler/66/head 2025-12-04T08:56:59.2898817Z * [new branch] gh/kurtamohler/66/orig -> origin/gh/kurtamohler/66/orig 2025-12-04T08:56:59.2900941Z * [new branch] gh/kurtamohler/67/base -> origin/gh/kurtamohler/67/base 2025-12-04T08:56:59.2902458Z * [new branch] gh/kurtamohler/67/head -> origin/gh/kurtamohler/67/head 2025-12-04T08:56:59.2904003Z * [new branch] gh/kurtamohler/67/orig -> origin/gh/kurtamohler/67/orig 2025-12-04T08:56:59.2906618Z * [new branch] gh/kwen2501/130/base -> origin/gh/kwen2501/130/base 2025-12-04T08:56:59.2908437Z * [new branch] gh/kwen2501/130/head -> origin/gh/kwen2501/130/head 2025-12-04T08:56:59.2910003Z * [new branch] gh/kwen2501/130/orig -> origin/gh/kwen2501/130/orig 2025-12-04T08:56:59.2912088Z * [new branch] gh/kwen2501/170/base -> origin/gh/kwen2501/170/base 2025-12-04T08:56:59.2913669Z * [new branch] gh/kwen2501/170/head -> origin/gh/kwen2501/170/head 2025-12-04T08:56:59.2915846Z * [new branch] gh/kwen2501/187/base -> origin/gh/kwen2501/187/base 2025-12-04T08:56:59.2917374Z * [new branch] gh/kwen2501/187/head -> origin/gh/kwen2501/187/head 2025-12-04T08:56:59.2919116Z * [new branch] gh/kwen2501/187/orig -> origin/gh/kwen2501/187/orig 2025-12-04T08:56:59.2921155Z * [new branch] gh/kwen2501/188/base -> origin/gh/kwen2501/188/base 2025-12-04T08:56:59.2922713Z * [new branch] gh/kwen2501/188/head -> origin/gh/kwen2501/188/head 2025-12-04T08:56:59.2924252Z * [new branch] gh/kwen2501/188/orig -> origin/gh/kwen2501/188/orig 2025-12-04T08:56:59.2926260Z * [new branch] gh/kwen2501/211/base -> origin/gh/kwen2501/211/base 2025-12-04T08:56:59.2927824Z * [new branch] gh/kwen2501/211/head -> origin/gh/kwen2501/211/head 2025-12-04T08:56:59.2929994Z * [new branch] gh/kwen2501/224/base -> origin/gh/kwen2501/224/base 2025-12-04T08:56:59.2931511Z * [new branch] gh/kwen2501/224/head -> origin/gh/kwen2501/224/head 2025-12-04T08:56:59.2933115Z * [new branch] gh/kwen2501/224/orig -> origin/gh/kwen2501/224/orig 2025-12-04T08:56:59.2935195Z * [new branch] gh/kwen2501/228/base -> origin/gh/kwen2501/228/base 2025-12-04T08:56:59.2936738Z * [new branch] gh/kwen2501/228/head -> origin/gh/kwen2501/228/head 2025-12-04T08:56:59.2938311Z * [new branch] gh/kwen2501/228/orig -> origin/gh/kwen2501/228/orig 2025-12-04T08:56:59.2940532Z * [new branch] gh/kwen2501/234/base -> origin/gh/kwen2501/234/base 2025-12-04T08:56:59.2942054Z * [new branch] gh/kwen2501/234/head -> origin/gh/kwen2501/234/head 2025-12-04T08:56:59.2943583Z * [new branch] gh/kwen2501/234/orig -> origin/gh/kwen2501/234/orig 2025-12-04T08:56:59.2945631Z * [new branch] gh/kwen2501/235/base -> origin/gh/kwen2501/235/base 2025-12-04T08:56:59.2947190Z * [new branch] gh/kwen2501/235/head -> origin/gh/kwen2501/235/head 2025-12-04T08:56:59.2948768Z * [new branch] gh/kwen2501/235/orig -> origin/gh/kwen2501/235/orig 2025-12-04T08:56:59.2950890Z * [new branch] gh/kwen2501/236/base -> origin/gh/kwen2501/236/base 2025-12-04T08:56:59.2952441Z * [new branch] gh/kwen2501/236/head -> origin/gh/kwen2501/236/head 2025-12-04T08:56:59.2954002Z * [new branch] gh/kwen2501/236/orig -> origin/gh/kwen2501/236/orig 2025-12-04T08:56:59.2956032Z * [new branch] gh/kwen2501/237/base -> origin/gh/kwen2501/237/base 2025-12-04T08:56:59.2957587Z * [new branch] gh/kwen2501/237/head -> origin/gh/kwen2501/237/head 2025-12-04T08:56:59.2959571Z * [new branch] gh/kwen2501/237/orig -> origin/gh/kwen2501/237/orig 2025-12-04T08:56:59.2961655Z * [new branch] gh/kwen2501/238/base -> origin/gh/kwen2501/238/base 2025-12-04T08:56:59.2963445Z * [new branch] gh/kwen2501/238/head -> origin/gh/kwen2501/238/head 2025-12-04T08:56:59.2964925Z * [new branch] gh/kwen2501/238/orig -> origin/gh/kwen2501/238/orig 2025-12-04T08:56:59.2967076Z * [new branch] gh/kwen2501/240/base -> origin/gh/kwen2501/240/base 2025-12-04T08:56:59.2968615Z * [new branch] gh/kwen2501/240/head -> origin/gh/kwen2501/240/head 2025-12-04T08:56:59.2970350Z * [new branch] gh/kwen2501/240/orig -> origin/gh/kwen2501/240/orig 2025-12-04T08:56:59.2972450Z * [new branch] gh/kwen2501/241/base -> origin/gh/kwen2501/241/base 2025-12-04T08:56:59.2974091Z * [new branch] gh/kwen2501/241/head -> origin/gh/kwen2501/241/head 2025-12-04T08:56:59.2975605Z * [new branch] gh/kwen2501/241/orig -> origin/gh/kwen2501/241/orig 2025-12-04T08:56:59.2977887Z * [new branch] gh/kwen2501/247/base -> origin/gh/kwen2501/247/base 2025-12-04T08:56:59.2979286Z * [new branch] gh/kwen2501/247/head -> origin/gh/kwen2501/247/head 2025-12-04T08:56:59.2980778Z * [new branch] gh/kwen2501/247/orig -> origin/gh/kwen2501/247/orig 2025-12-04T08:56:59.2982855Z * [new branch] gh/kwen2501/252/base -> origin/gh/kwen2501/252/base 2025-12-04T08:56:59.2984450Z * [new branch] gh/kwen2501/252/head -> origin/gh/kwen2501/252/head 2025-12-04T08:56:59.2985961Z * [new branch] gh/kwen2501/252/orig -> origin/gh/kwen2501/252/orig 2025-12-04T08:56:59.2988517Z * [new branch] gh/kwen2501/259/base -> origin/gh/kwen2501/259/base 2025-12-04T08:56:59.2990172Z * [new branch] gh/kwen2501/259/head -> origin/gh/kwen2501/259/head 2025-12-04T08:56:59.2991775Z * [new branch] gh/kwen2501/259/orig -> origin/gh/kwen2501/259/orig 2025-12-04T08:56:59.2994092Z * [new branch] gh/kwen2501/260/base -> origin/gh/kwen2501/260/base 2025-12-04T08:56:59.2995713Z * [new branch] gh/kwen2501/260/head -> origin/gh/kwen2501/260/head 2025-12-04T08:56:59.2997255Z * [new branch] gh/kwen2501/260/orig -> origin/gh/kwen2501/260/orig 2025-12-04T08:56:59.2999407Z * [new branch] gh/kwen2501/268/base -> origin/gh/kwen2501/268/base 2025-12-04T08:56:59.3000941Z * [new branch] gh/kwen2501/268/head -> origin/gh/kwen2501/268/head 2025-12-04T08:56:59.3002480Z * [new branch] gh/kwen2501/268/orig -> origin/gh/kwen2501/268/orig 2025-12-04T08:56:59.3004612Z * [new branch] gh/kwen2501/269/base -> origin/gh/kwen2501/269/base 2025-12-04T08:56:59.3006286Z * [new branch] gh/kwen2501/269/head -> origin/gh/kwen2501/269/head 2025-12-04T08:56:59.3007815Z * [new branch] gh/kwen2501/269/orig -> origin/gh/kwen2501/269/orig 2025-12-04T08:56:59.3009958Z * [new branch] gh/kwen2501/270/base -> origin/gh/kwen2501/270/base 2025-12-04T08:56:59.3011584Z * [new branch] gh/kwen2501/270/head -> origin/gh/kwen2501/270/head 2025-12-04T08:56:59.3013319Z * [new branch] gh/kwen2501/270/orig -> origin/gh/kwen2501/270/orig 2025-12-04T08:56:59.3015588Z * [new branch] gh/kwen2501/271/base -> origin/gh/kwen2501/271/base 2025-12-04T08:56:59.3017149Z * [new branch] gh/kwen2501/271/head -> origin/gh/kwen2501/271/head 2025-12-04T08:56:59.3019189Z * [new branch] gh/kwen2501/271/orig -> origin/gh/kwen2501/271/orig 2025-12-04T08:56:59.3021390Z * [new branch] gh/kwen2501/274/base -> origin/gh/kwen2501/274/base 2025-12-04T08:56:59.3023159Z * [new branch] gh/kwen2501/274/head -> origin/gh/kwen2501/274/head 2025-12-04T08:56:59.3024702Z * [new branch] gh/kwen2501/274/orig -> origin/gh/kwen2501/274/orig 2025-12-04T08:56:59.3026909Z * [new branch] gh/kwen2501/275/base -> origin/gh/kwen2501/275/base 2025-12-04T08:56:59.3028604Z * [new branch] gh/kwen2501/275/head -> origin/gh/kwen2501/275/head 2025-12-04T08:56:59.3030145Z * [new branch] gh/kwen2501/275/orig -> origin/gh/kwen2501/275/orig 2025-12-04T08:56:59.3032271Z * [new branch] gh/kwen2501/276/base -> origin/gh/kwen2501/276/base 2025-12-04T08:56:59.3034006Z * [new branch] gh/kwen2501/276/head -> origin/gh/kwen2501/276/head 2025-12-04T08:56:59.3035537Z * [new branch] gh/kwen2501/276/orig -> origin/gh/kwen2501/276/orig 2025-12-04T08:56:59.3038101Z * [new branch] gh/kwen2501/277/base -> origin/gh/kwen2501/277/base 2025-12-04T08:56:59.3039647Z * [new branch] gh/kwen2501/277/head -> origin/gh/kwen2501/277/head 2025-12-04T08:56:59.3041277Z * [new branch] gh/kwen2501/277/orig -> origin/gh/kwen2501/277/orig 2025-12-04T08:56:59.3043401Z * [new branch] gh/kwen2501/278/base -> origin/gh/kwen2501/278/base 2025-12-04T08:56:59.3044947Z * [new branch] gh/kwen2501/278/head -> origin/gh/kwen2501/278/head 2025-12-04T08:56:59.3046492Z * [new branch] gh/kwen2501/278/orig -> origin/gh/kwen2501/278/orig 2025-12-04T08:56:59.3048781Z * [new branch] gh/kwen2501/279/base -> origin/gh/kwen2501/279/base 2025-12-04T08:56:59.3050466Z * [new branch] gh/kwen2501/279/head -> origin/gh/kwen2501/279/head 2025-12-04T08:56:59.3052077Z * [new branch] gh/kwen2501/279/orig -> origin/gh/kwen2501/279/orig 2025-12-04T08:56:59.3054404Z * [new branch] gh/kwen2501/280/base -> origin/gh/kwen2501/280/base 2025-12-04T08:56:59.3056000Z * [new branch] gh/kwen2501/280/head -> origin/gh/kwen2501/280/head 2025-12-04T08:56:59.3057541Z * [new branch] gh/kwen2501/280/orig -> origin/gh/kwen2501/280/orig 2025-12-04T08:56:59.3059758Z * [new branch] gh/kwen2501/281/base -> origin/gh/kwen2501/281/base 2025-12-04T08:56:59.3061333Z * [new branch] gh/kwen2501/281/head -> origin/gh/kwen2501/281/head 2025-12-04T08:56:59.3065057Z * [new branch] gh/kwen2501/281/orig -> origin/gh/kwen2501/281/orig 2025-12-04T08:56:59.3067228Z * [new branch] gh/kwen2501/282/base -> origin/gh/kwen2501/282/base 2025-12-04T08:56:59.3068798Z * [new branch] gh/kwen2501/282/head -> origin/gh/kwen2501/282/head 2025-12-04T08:56:59.3070355Z * [new branch] gh/kwen2501/282/orig -> origin/gh/kwen2501/282/orig 2025-12-04T08:56:59.3072523Z * [new branch] gh/kwen2501/283/base -> origin/gh/kwen2501/283/base 2025-12-04T08:56:59.3074085Z * [new branch] gh/kwen2501/283/head -> origin/gh/kwen2501/283/head 2025-12-04T08:56:59.3075607Z * [new branch] gh/kwen2501/283/orig -> origin/gh/kwen2501/283/orig 2025-12-04T08:56:59.3077831Z * [new branch] gh/kwen2501/284/base -> origin/gh/kwen2501/284/base 2025-12-04T08:56:59.3079467Z * [new branch] gh/kwen2501/284/head -> origin/gh/kwen2501/284/head 2025-12-04T08:56:59.3081046Z * [new branch] gh/kwen2501/284/orig -> origin/gh/kwen2501/284/orig 2025-12-04T08:56:59.3083261Z * [new branch] gh/kwen2501/285/base -> origin/gh/kwen2501/285/base 2025-12-04T08:56:59.3084781Z * [new branch] gh/kwen2501/285/head -> origin/gh/kwen2501/285/head 2025-12-04T08:56:59.3086333Z * [new branch] gh/kwen2501/285/orig -> origin/gh/kwen2501/285/orig 2025-12-04T08:56:59.3088464Z * [new branch] gh/kwen2501/286/base -> origin/gh/kwen2501/286/base 2025-12-04T08:56:59.3090489Z * [new branch] gh/kwen2501/286/head -> origin/gh/kwen2501/286/head 2025-12-04T08:56:59.3092067Z * [new branch] gh/kwen2501/286/orig -> origin/gh/kwen2501/286/orig 2025-12-04T08:56:59.3094290Z * [new branch] gh/kwen2501/287/base -> origin/gh/kwen2501/287/base 2025-12-04T08:56:59.3095821Z * [new branch] gh/kwen2501/287/head -> origin/gh/kwen2501/287/head 2025-12-04T08:56:59.3097346Z * [new branch] gh/kwen2501/287/orig -> origin/gh/kwen2501/287/orig 2025-12-04T08:56:59.3099733Z * [new branch] gh/kwen2501/288/base -> origin/gh/kwen2501/288/base 2025-12-04T08:56:59.3101195Z * [new branch] gh/kwen2501/288/head -> origin/gh/kwen2501/288/head 2025-12-04T08:56:59.3102836Z * [new branch] gh/kwen2501/288/orig -> origin/gh/kwen2501/288/orig 2025-12-04T08:56:59.3105451Z * [new branch] gh/laithsakka/251/base -> origin/gh/laithsakka/251/base 2025-12-04T08:56:59.3107546Z * [new branch] gh/laithsakka/251/head -> origin/gh/laithsakka/251/head 2025-12-04T08:56:59.3109129Z * [new branch] gh/laithsakka/251/orig -> origin/gh/laithsakka/251/orig 2025-12-04T08:56:59.3111263Z * [new branch] gh/laithsakka/276/base -> origin/gh/laithsakka/276/base 2025-12-04T08:56:59.3112984Z * [new branch] gh/laithsakka/276/head -> origin/gh/laithsakka/276/head 2025-12-04T08:56:59.3114347Z * [new branch] gh/laithsakka/276/orig -> origin/gh/laithsakka/276/orig 2025-12-04T08:56:59.3116637Z * [new branch] gh/laithsakka/28/base -> origin/gh/laithsakka/28/base 2025-12-04T08:56:59.3118613Z * [new branch] gh/laithsakka/29/base -> origin/gh/laithsakka/29/base 2025-12-04T08:56:59.3120499Z * [new branch] gh/laithsakka/30/base -> origin/gh/laithsakka/30/base 2025-12-04T08:56:59.3122022Z * [new branch] gh/laithsakka/30/head -> origin/gh/laithsakka/30/head 2025-12-04T08:56:59.3135252Z * [new branch] gh/laithsakka/31/base -> origin/gh/laithsakka/31/base 2025-12-04T08:56:59.3135826Z * [new branch] gh/laithsakka/31/head -> origin/gh/laithsakka/31/head 2025-12-04T08:56:59.3136393Z * [new branch] gh/laithsakka/313/base -> origin/gh/laithsakka/313/base 2025-12-04T08:56:59.3136794Z * [new branch] gh/laithsakka/313/head -> origin/gh/laithsakka/313/head 2025-12-04T08:56:59.3137196Z * [new branch] gh/laithsakka/313/orig -> origin/gh/laithsakka/313/orig 2025-12-04T08:56:59.3137581Z * [new branch] gh/laithsakka/316/base -> origin/gh/laithsakka/316/base 2025-12-04T08:56:59.3138176Z * [new branch] gh/laithsakka/316/head -> origin/gh/laithsakka/316/head 2025-12-04T08:56:59.3138615Z * [new branch] gh/laithsakka/316/orig -> origin/gh/laithsakka/316/orig 2025-12-04T08:56:59.3138998Z * [new branch] gh/laithsakka/317/base -> origin/gh/laithsakka/317/base 2025-12-04T08:56:59.3139900Z * [new branch] gh/laithsakka/317/head -> origin/gh/laithsakka/317/head 2025-12-04T08:56:59.3141588Z * [new branch] gh/laithsakka/317/orig -> origin/gh/laithsakka/317/orig 2025-12-04T08:56:59.3143676Z * [new branch] gh/laithsakka/319/base -> origin/gh/laithsakka/319/base 2025-12-04T08:56:59.3145354Z * [new branch] gh/laithsakka/319/head -> origin/gh/laithsakka/319/head 2025-12-04T08:56:59.3146821Z * [new branch] gh/laithsakka/319/orig -> origin/gh/laithsakka/319/orig 2025-12-04T08:56:59.3148783Z * [new branch] gh/laithsakka/32/base -> origin/gh/laithsakka/32/base 2025-12-04T08:56:59.3150393Z * [new branch] gh/laithsakka/32/head -> origin/gh/laithsakka/32/head 2025-12-04T08:56:59.3152359Z * [new branch] gh/laithsakka/320/base -> origin/gh/laithsakka/320/base 2025-12-04T08:56:59.3153825Z * [new branch] gh/laithsakka/320/head -> origin/gh/laithsakka/320/head 2025-12-04T08:56:59.3155432Z * [new branch] gh/laithsakka/320/orig -> origin/gh/laithsakka/320/orig 2025-12-04T08:56:59.3157439Z * [new branch] gh/laithsakka/321/base -> origin/gh/laithsakka/321/base 2025-12-04T08:56:59.3158896Z * [new branch] gh/laithsakka/321/head -> origin/gh/laithsakka/321/head 2025-12-04T08:56:59.3160669Z * [new branch] gh/laithsakka/321/orig -> origin/gh/laithsakka/321/orig 2025-12-04T08:56:59.3162959Z * [new branch] gh/laithsakka/322/base -> origin/gh/laithsakka/322/base 2025-12-04T08:56:59.3164547Z * [new branch] gh/laithsakka/322/head -> origin/gh/laithsakka/322/head 2025-12-04T08:56:59.3166160Z * [new branch] gh/laithsakka/322/orig -> origin/gh/laithsakka/322/orig 2025-12-04T08:56:59.3168354Z * [new branch] gh/laithsakka/323/base -> origin/gh/laithsakka/323/base 2025-12-04T08:56:59.3170086Z * [new branch] gh/laithsakka/323/head -> origin/gh/laithsakka/323/head 2025-12-04T08:56:59.3171646Z * [new branch] gh/laithsakka/323/orig -> origin/gh/laithsakka/323/orig 2025-12-04T08:56:59.3173836Z * [new branch] gh/laithsakka/324/base -> origin/gh/laithsakka/324/base 2025-12-04T08:56:59.3175381Z * [new branch] gh/laithsakka/324/head -> origin/gh/laithsakka/324/head 2025-12-04T08:56:59.3176979Z * [new branch] gh/laithsakka/324/orig -> origin/gh/laithsakka/324/orig 2025-12-04T08:56:59.3179118Z * [new branch] gh/laithsakka/325/base -> origin/gh/laithsakka/325/base 2025-12-04T08:56:59.3181102Z * [new branch] gh/laithsakka/325/head -> origin/gh/laithsakka/325/head 2025-12-04T08:56:59.3182735Z * [new branch] gh/laithsakka/325/orig -> origin/gh/laithsakka/325/orig 2025-12-04T08:56:59.3185149Z * [new branch] gh/laithsakka/326/base -> origin/gh/laithsakka/326/base 2025-12-04T08:56:59.3186747Z * [new branch] gh/laithsakka/326/head -> origin/gh/laithsakka/326/head 2025-12-04T08:56:59.3188354Z * [new branch] gh/laithsakka/326/orig -> origin/gh/laithsakka/326/orig 2025-12-04T08:56:59.3190535Z * [new branch] gh/laithsakka/327/base -> origin/gh/laithsakka/327/base 2025-12-04T08:56:59.3192232Z * [new branch] gh/laithsakka/327/head -> origin/gh/laithsakka/327/head 2025-12-04T08:56:59.3193996Z * [new branch] gh/laithsakka/327/orig -> origin/gh/laithsakka/327/orig 2025-12-04T08:56:59.3196144Z * [new branch] gh/laithsakka/328/base -> origin/gh/laithsakka/328/base 2025-12-04T08:56:59.3198169Z * [new branch] gh/laithsakka/328/head -> origin/gh/laithsakka/328/head 2025-12-04T08:56:59.3199753Z * [new branch] gh/laithsakka/328/orig -> origin/gh/laithsakka/328/orig 2025-12-04T08:56:59.3202192Z * [new branch] gh/liangel/4/base -> origin/gh/liangel/4/base 2025-12-04T08:56:59.3203796Z * [new branch] gh/liangel/4/head -> origin/gh/liangel/4/head 2025-12-04T08:56:59.3205684Z * [new branch] gh/liangel/4/orig -> origin/gh/liangel/4/orig 2025-12-04T08:56:59.3209646Z * [new branch] gh/lucaskabela/1/base -> origin/gh/lucaskabela/1/base 2025-12-04T08:56:59.3211256Z * [new branch] gh/lucaskabela/1/head -> origin/gh/lucaskabela/1/head 2025-12-04T08:56:59.3213840Z * [new branch] gh/lw/4/base -> origin/gh/lw/4/base 2025-12-04T08:56:59.3215393Z * [new branch] gh/lw/4/head -> origin/gh/lw/4/head 2025-12-04T08:56:59.3216965Z * [new branch] gh/lw/4/orig -> origin/gh/lw/4/orig 2025-12-04T08:56:59.3218894Z * [new branch] gh/lw/5/base -> origin/gh/lw/5/base 2025-12-04T08:56:59.3220465Z * [new branch] gh/lw/5/head -> origin/gh/lw/5/head 2025-12-04T08:56:59.3221976Z * [new branch] gh/lw/5/orig -> origin/gh/lw/5/orig 2025-12-04T08:56:59.3224062Z * [new branch] gh/lw/6/base -> origin/gh/lw/6/base 2025-12-04T08:56:59.3225633Z * [new branch] gh/lw/6/head -> origin/gh/lw/6/head 2025-12-04T08:56:59.3227130Z * [new branch] gh/lw/6/orig -> origin/gh/lw/6/orig 2025-12-04T08:56:59.3229828Z * [new branch] gh/malfet/14/base -> origin/gh/malfet/14/base 2025-12-04T08:56:59.3231922Z * [new branch] gh/malfet/417/base -> origin/gh/malfet/417/base 2025-12-04T08:56:59.3233228Z * [new branch] gh/malfet/417/head -> origin/gh/malfet/417/head 2025-12-04T08:56:59.3234889Z * [new branch] gh/malfet/417/orig -> origin/gh/malfet/417/orig 2025-12-04T08:56:59.3237319Z * [new branch] gh/malfet/506/base -> origin/gh/malfet/506/base 2025-12-04T08:56:59.3238989Z * [new branch] gh/malfet/506/head -> origin/gh/malfet/506/head 2025-12-04T08:56:59.3240406Z * [new branch] gh/malfet/506/orig -> origin/gh/malfet/506/orig 2025-12-04T08:56:59.3242763Z * [new branch] gh/malfet/517/base -> origin/gh/malfet/517/base 2025-12-04T08:56:59.3244360Z * [new branch] gh/malfet/517/head -> origin/gh/malfet/517/head 2025-12-04T08:56:59.3246467Z * [new branch] gh/malfet/528/base -> origin/gh/malfet/528/base 2025-12-04T08:56:59.3248011Z * [new branch] gh/malfet/528/head -> origin/gh/malfet/528/head 2025-12-04T08:56:59.3250073Z * [new branch] gh/malfet/528/orig -> origin/gh/malfet/528/orig 2025-12-04T08:56:59.3252166Z * [new branch] gh/malfet/537/base -> origin/gh/malfet/537/base 2025-12-04T08:56:59.3253880Z * [new branch] gh/malfet/537/head -> origin/gh/malfet/537/head 2025-12-04T08:56:59.3255419Z * [new branch] gh/malfet/537/orig -> origin/gh/malfet/537/orig 2025-12-04T08:56:59.3257558Z * [new branch] gh/malfet/546/base -> origin/gh/malfet/546/base 2025-12-04T08:56:59.3259103Z * [new branch] gh/malfet/546/head -> origin/gh/malfet/546/head 2025-12-04T08:56:59.3260721Z * [new branch] gh/malfet/546/orig -> origin/gh/malfet/546/orig 2025-12-04T08:56:59.3264415Z * [new branch] gh/malfet/565/base -> origin/gh/malfet/565/base 2025-12-04T08:56:59.3266125Z * [new branch] gh/malfet/565/head -> origin/gh/malfet/565/head 2025-12-04T08:56:59.3267573Z * [new branch] gh/malfet/565/orig -> origin/gh/malfet/565/orig 2025-12-04T08:56:59.3269701Z * [new branch] gh/malfet/575/base -> origin/gh/malfet/575/base 2025-12-04T08:56:59.3271235Z * [new branch] gh/malfet/575/head -> origin/gh/malfet/575/head 2025-12-04T08:56:59.3272860Z * [new branch] gh/malfet/575/orig -> origin/gh/malfet/575/orig 2025-12-04T08:56:59.3274919Z * [new branch] gh/malfet/580/base -> origin/gh/malfet/580/base 2025-12-04T08:56:59.3276478Z * [new branch] gh/malfet/580/head -> origin/gh/malfet/580/head 2025-12-04T08:56:59.3278028Z * [new branch] gh/malfet/580/orig -> origin/gh/malfet/580/orig 2025-12-04T08:56:59.3279975Z * [new branch] gh/malfet/581/base -> origin/gh/malfet/581/base 2025-12-04T08:56:59.3281570Z * [new branch] gh/malfet/581/head -> origin/gh/malfet/581/head 2025-12-04T08:56:59.3283089Z * [new branch] gh/malfet/581/orig -> origin/gh/malfet/581/orig 2025-12-04T08:56:59.3285050Z * [new branch] gh/malfet/583/base -> origin/gh/malfet/583/base 2025-12-04T08:56:59.3286710Z * [new branch] gh/malfet/583/head -> origin/gh/malfet/583/head 2025-12-04T08:56:59.3288217Z * [new branch] gh/malfet/583/orig -> origin/gh/malfet/583/orig 2025-12-04T08:56:59.3290225Z * [new branch] gh/malfet/586/base -> origin/gh/malfet/586/base 2025-12-04T08:56:59.3291835Z * [new branch] gh/malfet/586/head -> origin/gh/malfet/586/head 2025-12-04T08:56:59.3293728Z * [new branch] gh/malfet/586/orig -> origin/gh/malfet/586/orig 2025-12-04T08:56:59.3295651Z * [new branch] gh/malfet/587/base -> origin/gh/malfet/587/base 2025-12-04T08:56:59.3297187Z * [new branch] gh/malfet/587/head -> origin/gh/malfet/587/head 2025-12-04T08:56:59.3298718Z * [new branch] gh/malfet/587/orig -> origin/gh/malfet/587/orig 2025-12-04T08:56:59.3300723Z * [new branch] gh/malfet/588/base -> origin/gh/malfet/588/base 2025-12-04T08:56:59.3302322Z * [new branch] gh/malfet/588/head -> origin/gh/malfet/588/head 2025-12-04T08:56:59.3304006Z * [new branch] gh/malfet/588/orig -> origin/gh/malfet/588/orig 2025-12-04T08:56:59.3305976Z * [new branch] gh/malfet/589/base -> origin/gh/malfet/589/base 2025-12-04T08:56:59.3307507Z * [new branch] gh/malfet/589/head -> origin/gh/malfet/589/head 2025-12-04T08:56:59.3309175Z * [new branch] gh/malfet/589/orig -> origin/gh/malfet/589/orig 2025-12-04T08:56:59.3311151Z * [new branch] gh/malfet/590/base -> origin/gh/malfet/590/base 2025-12-04T08:56:59.3312732Z * [new branch] gh/malfet/590/head -> origin/gh/malfet/590/head 2025-12-04T08:56:59.3314357Z * [new branch] gh/malfet/590/orig -> origin/gh/malfet/590/orig 2025-12-04T08:56:59.3316957Z * [new branch] gh/malfet/591/base -> origin/gh/malfet/591/base 2025-12-04T08:56:59.3318527Z * [new branch] gh/malfet/591/head -> origin/gh/malfet/591/head 2025-12-04T08:56:59.3320092Z * [new branch] gh/malfet/591/orig -> origin/gh/malfet/591/orig 2025-12-04T08:56:59.3322182Z * [new branch] gh/malfet/592/base -> origin/gh/malfet/592/base 2025-12-04T08:56:59.3323831Z * [new branch] gh/malfet/592/head -> origin/gh/malfet/592/head 2025-12-04T08:56:59.3325395Z * [new branch] gh/malfet/592/orig -> origin/gh/malfet/592/orig 2025-12-04T08:56:59.3327466Z * [new branch] gh/malfet/593/base -> origin/gh/malfet/593/base 2025-12-04T08:56:59.3329067Z * [new branch] gh/malfet/593/head -> origin/gh/malfet/593/head 2025-12-04T08:56:59.3330638Z * [new branch] gh/malfet/593/orig -> origin/gh/malfet/593/orig 2025-12-04T08:56:59.3332811Z * [new branch] gh/malfet/594/base -> origin/gh/malfet/594/base 2025-12-04T08:56:59.3334368Z * [new branch] gh/malfet/594/head -> origin/gh/malfet/594/head 2025-12-04T08:56:59.3335993Z * [new branch] gh/malfet/594/orig -> origin/gh/malfet/594/orig 2025-12-04T08:56:59.3338112Z * [new branch] gh/malfet/595/base -> origin/gh/malfet/595/base 2025-12-04T08:56:59.3339533Z * [new branch] gh/malfet/595/head -> origin/gh/malfet/595/head 2025-12-04T08:56:59.3341214Z * [new branch] gh/malfet/595/orig -> origin/gh/malfet/595/orig 2025-12-04T08:56:59.3343310Z * [new branch] gh/malfet/596/base -> origin/gh/malfet/596/base 2025-12-04T08:56:59.3344924Z * [new branch] gh/malfet/596/head -> origin/gh/malfet/596/head 2025-12-04T08:56:59.3346355Z * [new branch] gh/malfet/596/orig -> origin/gh/malfet/596/orig 2025-12-04T08:56:59.3348473Z * [new branch] gh/malfet/597/base -> origin/gh/malfet/597/base 2025-12-04T08:56:59.3350082Z * [new branch] gh/malfet/597/head -> origin/gh/malfet/597/head 2025-12-04T08:56:59.3351620Z * [new branch] gh/malfet/597/orig -> origin/gh/malfet/597/orig 2025-12-04T08:56:59.3353651Z * [new branch] gh/malfet/598/base -> origin/gh/malfet/598/base 2025-12-04T08:56:59.3355236Z * [new branch] gh/malfet/598/head -> origin/gh/malfet/598/head 2025-12-04T08:56:59.3356971Z * [new branch] gh/malfet/598/orig -> origin/gh/malfet/598/orig 2025-12-04T08:56:59.3359032Z * [new branch] gh/malfet/599/base -> origin/gh/malfet/599/base 2025-12-04T08:56:59.3360595Z * [new branch] gh/malfet/599/head -> origin/gh/malfet/599/head 2025-12-04T08:56:59.3362139Z * [new branch] gh/malfet/599/orig -> origin/gh/malfet/599/orig 2025-12-04T08:56:59.3364401Z * [new branch] gh/malfet/600/base -> origin/gh/malfet/600/base 2025-12-04T08:56:59.3366077Z * [new branch] gh/malfet/600/head -> origin/gh/malfet/600/head 2025-12-04T08:56:59.3367483Z * [new branch] gh/malfet/600/orig -> origin/gh/malfet/600/orig 2025-12-04T08:56:59.3369670Z * [new branch] gh/malfet/601/base -> origin/gh/malfet/601/base 2025-12-04T08:56:59.3371372Z * [new branch] gh/malfet/601/head -> origin/gh/malfet/601/head 2025-12-04T08:56:59.3373003Z * [new branch] gh/malfet/601/orig -> origin/gh/malfet/601/orig 2025-12-04T08:56:59.3375050Z * [new branch] gh/malfet/602/base -> origin/gh/malfet/602/base 2025-12-04T08:56:59.3376622Z * [new branch] gh/malfet/602/head -> origin/gh/malfet/602/head 2025-12-04T08:56:59.3378208Z * [new branch] gh/malfet/602/orig -> origin/gh/malfet/602/orig 2025-12-04T08:56:59.3380236Z * [new branch] gh/malfet/603/base -> origin/gh/malfet/603/base 2025-12-04T08:56:59.3381693Z * [new branch] gh/malfet/603/head -> origin/gh/malfet/603/head 2025-12-04T08:56:59.3383223Z * [new branch] gh/malfet/603/orig -> origin/gh/malfet/603/orig 2025-12-04T08:56:59.3385345Z * [new branch] gh/malfet/604/base -> origin/gh/malfet/604/base 2025-12-04T08:56:59.3386864Z * [new branch] gh/malfet/604/head -> origin/gh/malfet/604/head 2025-12-04T08:56:59.3388475Z * [new branch] gh/malfet/604/orig -> origin/gh/malfet/604/orig 2025-12-04T08:56:59.3390662Z * [new branch] gh/malfet/605/base -> origin/gh/malfet/605/base 2025-12-04T08:56:59.3392160Z * [new branch] gh/malfet/605/head -> origin/gh/malfet/605/head 2025-12-04T08:56:59.3393779Z * [new branch] gh/malfet/605/orig -> origin/gh/malfet/605/orig 2025-12-04T08:56:59.3396401Z * [new branch] gh/malfet/606/base -> origin/gh/malfet/606/base 2025-12-04T08:56:59.3398001Z * [new branch] gh/malfet/606/head -> origin/gh/malfet/606/head 2025-12-04T08:56:59.3399656Z * [new branch] gh/malfet/606/orig -> origin/gh/malfet/606/orig 2025-12-04T08:56:59.3401842Z * [new branch] gh/malfet/607/base -> origin/gh/malfet/607/base 2025-12-04T08:56:59.3403420Z * [new branch] gh/malfet/607/head -> origin/gh/malfet/607/head 2025-12-04T08:56:59.3404991Z * [new branch] gh/malfet/607/orig -> origin/gh/malfet/607/orig 2025-12-04T08:56:59.3407121Z * [new branch] gh/malfet/608/base -> origin/gh/malfet/608/base 2025-12-04T08:56:59.3409067Z * [new branch] gh/malfet/608/head -> origin/gh/malfet/608/head 2025-12-04T08:56:59.3410640Z * [new branch] gh/malfet/608/orig -> origin/gh/malfet/608/orig 2025-12-04T08:56:59.3412982Z * [new branch] gh/malfet/609/base -> origin/gh/malfet/609/base 2025-12-04T08:56:59.3414351Z * [new branch] gh/malfet/609/head -> origin/gh/malfet/609/head 2025-12-04T08:56:59.3416061Z * [new branch] gh/malfet/609/orig -> origin/gh/malfet/609/orig 2025-12-04T08:56:59.3418188Z * [new branch] gh/malfet/610/base -> origin/gh/malfet/610/base 2025-12-04T08:56:59.3419881Z * [new branch] gh/malfet/610/head -> origin/gh/malfet/610/head 2025-12-04T08:56:59.3421299Z * [new branch] gh/malfet/610/orig -> origin/gh/malfet/610/orig 2025-12-04T08:56:59.3423514Z * [new branch] gh/malfet/611/base -> origin/gh/malfet/611/base 2025-12-04T08:56:59.3425206Z * [new branch] gh/malfet/611/head -> origin/gh/malfet/611/head 2025-12-04T08:56:59.3426737Z * [new branch] gh/malfet/611/orig -> origin/gh/malfet/611/orig 2025-12-04T08:56:59.3428778Z * [new branch] gh/malfet/612/base -> origin/gh/malfet/612/base 2025-12-04T08:56:59.3430299Z * [new branch] gh/malfet/612/head -> origin/gh/malfet/612/head 2025-12-04T08:56:59.3431885Z * [new branch] gh/malfet/612/orig -> origin/gh/malfet/612/orig 2025-12-04T08:56:59.3434097Z * [new branch] gh/malfet/64/base -> origin/gh/malfet/64/base 2025-12-04T08:56:59.3435662Z * [new branch] gh/malfet/64/head -> origin/gh/malfet/64/head 2025-12-04T08:56:59.3438179Z * [new branch] gh/manuelcandales/11/base -> origin/gh/manuelcandales/11/base 2025-12-04T08:56:59.3439708Z * [new branch] gh/manuelcandales/11/head -> origin/gh/manuelcandales/11/head 2025-12-04T08:56:59.3441388Z * [new branch] gh/manuelcandales/11/orig -> origin/gh/manuelcandales/11/orig 2025-12-04T08:56:59.3444031Z * [new branch] gh/markkm/1/base -> origin/gh/markkm/1/base 2025-12-04T08:56:59.3446572Z * [new branch] gh/masnesral/1/base -> origin/gh/masnesral/1/base 2025-12-04T08:56:59.3448128Z * [new branch] gh/masnesral/1/head -> origin/gh/masnesral/1/head 2025-12-04T08:56:59.3449691Z * [new branch] gh/masnesral/1/orig -> origin/gh/masnesral/1/orig 2025-12-04T08:56:59.3452381Z * [new branch] gh/mhorowitz/0/base -> origin/gh/mhorowitz/0/base 2025-12-04T08:56:59.3453972Z * [new branch] gh/mhorowitz/0/head -> origin/gh/mhorowitz/0/head 2025-12-04T08:56:59.3455920Z * [new branch] gh/mhorowitz/1/base -> origin/gh/mhorowitz/1/base 2025-12-04T08:56:59.3457487Z * [new branch] gh/mhorowitz/1/head -> origin/gh/mhorowitz/1/head 2025-12-04T08:56:59.3459538Z * [new branch] gh/mhorowitz/2/base -> origin/gh/mhorowitz/2/base 2025-12-04T08:56:59.3461050Z * [new branch] gh/mhorowitz/2/head -> origin/gh/mhorowitz/2/head 2025-12-04T08:56:59.3463263Z * [new branch] gh/mhorowitz/3/base -> origin/gh/mhorowitz/3/base 2025-12-04T08:56:59.3464795Z * [new branch] gh/mhorowitz/3/head -> origin/gh/mhorowitz/3/head 2025-12-04T08:56:59.3466727Z * [new branch] gh/mhorowitz/4/base -> origin/gh/mhorowitz/4/base 2025-12-04T08:56:59.3468201Z * [new branch] gh/mhorowitz/4/head -> origin/gh/mhorowitz/4/head 2025-12-04T08:56:59.3470140Z * [new branch] gh/mhorowitz/5/base -> origin/gh/mhorowitz/5/base 2025-12-04T08:56:59.3471650Z * [new branch] gh/mhorowitz/5/head -> origin/gh/mhorowitz/5/head 2025-12-04T08:56:59.3473618Z * [new branch] gh/mhorowitz/6/base -> origin/gh/mhorowitz/6/base 2025-12-04T08:56:59.3475097Z * [new branch] gh/mhorowitz/6/head -> origin/gh/mhorowitz/6/head 2025-12-04T08:56:59.3477721Z * [new branch] gh/mikaylagawarecki/234/base -> origin/gh/mikaylagawarecki/234/base 2025-12-04T08:56:59.3479363Z * [new branch] gh/mikaylagawarecki/234/head -> origin/gh/mikaylagawarecki/234/head 2025-12-04T08:56:59.3481412Z * [new branch] gh/mikaylagawarecki/235/base -> origin/gh/mikaylagawarecki/235/base 2025-12-04T08:56:59.3482940Z * [new branch] gh/mikaylagawarecki/235/head -> origin/gh/mikaylagawarecki/235/head 2025-12-04T08:56:59.3485064Z * [new branch] gh/mikaylagawarecki/236/base -> origin/gh/mikaylagawarecki/236/base 2025-12-04T08:56:59.3486320Z * [new branch] gh/mikaylagawarecki/236/head -> origin/gh/mikaylagawarecki/236/head 2025-12-04T08:56:59.3488438Z * [new branch] gh/mikaylagawarecki/237/base -> origin/gh/mikaylagawarecki/237/base 2025-12-04T08:56:59.3489901Z * [new branch] gh/mikaylagawarecki/237/head -> origin/gh/mikaylagawarecki/237/head 2025-12-04T08:56:59.3492036Z * [new branch] gh/mikaylagawarecki/238/base -> origin/gh/mikaylagawarecki/238/base 2025-12-04T08:56:59.3493768Z * [new branch] gh/mikaylagawarecki/238/head -> origin/gh/mikaylagawarecki/238/head 2025-12-04T08:56:59.3495994Z * [new branch] gh/mikaylagawarecki/336/base -> origin/gh/mikaylagawarecki/336/base 2025-12-04T08:56:59.3497550Z * [new branch] gh/mikaylagawarecki/336/head -> origin/gh/mikaylagawarecki/336/head 2025-12-04T08:56:59.3499104Z * [new branch] gh/mikaylagawarecki/336/orig -> origin/gh/mikaylagawarecki/336/orig 2025-12-04T08:56:59.3501361Z * [new branch] gh/mikaylagawarecki/341/base -> origin/gh/mikaylagawarecki/341/base 2025-12-04T08:56:59.3502878Z * [new branch] gh/mikaylagawarecki/341/head -> origin/gh/mikaylagawarecki/341/head 2025-12-04T08:56:59.3504270Z * [new branch] gh/mikaylagawarecki/341/orig -> origin/gh/mikaylagawarecki/341/orig 2025-12-04T08:56:59.3506623Z * [new branch] gh/mikaylagawarecki/342/base -> origin/gh/mikaylagawarecki/342/base 2025-12-04T08:56:59.3508039Z * [new branch] gh/mikaylagawarecki/342/head -> origin/gh/mikaylagawarecki/342/head 2025-12-04T08:56:59.3509607Z * [new branch] gh/mikaylagawarecki/342/orig -> origin/gh/mikaylagawarecki/342/orig 2025-12-04T08:56:59.3511880Z * [new branch] gh/mikaylagawarecki/345/base -> origin/gh/mikaylagawarecki/345/base 2025-12-04T08:56:59.3513379Z * [new branch] gh/mikaylagawarecki/345/head -> origin/gh/mikaylagawarecki/345/head 2025-12-04T08:56:59.3514945Z * [new branch] gh/mikaylagawarecki/345/orig -> origin/gh/mikaylagawarecki/345/orig 2025-12-04T08:56:59.3517242Z * [new branch] gh/mikaylagawarecki/346/base -> origin/gh/mikaylagawarecki/346/base 2025-12-04T08:56:59.3518778Z * [new branch] gh/mikaylagawarecki/346/head -> origin/gh/mikaylagawarecki/346/head 2025-12-04T08:56:59.3520369Z * [new branch] gh/mikaylagawarecki/346/orig -> origin/gh/mikaylagawarecki/346/orig 2025-12-04T08:56:59.3522512Z * [new branch] gh/mikaylagawarecki/347/base -> origin/gh/mikaylagawarecki/347/base 2025-12-04T08:56:59.3524007Z * [new branch] gh/mikaylagawarecki/347/head -> origin/gh/mikaylagawarecki/347/head 2025-12-04T08:56:59.3525547Z * [new branch] gh/mikaylagawarecki/347/orig -> origin/gh/mikaylagawarecki/347/orig 2025-12-04T08:56:59.3527677Z * [new branch] gh/mikaylagawarecki/350/base -> origin/gh/mikaylagawarecki/350/base 2025-12-04T08:56:59.3529267Z * [new branch] gh/mikaylagawarecki/350/head -> origin/gh/mikaylagawarecki/350/head 2025-12-04T08:56:59.3530787Z * [new branch] gh/mikaylagawarecki/350/orig -> origin/gh/mikaylagawarecki/350/orig 2025-12-04T08:56:59.3533444Z * [new branch] gh/mikaylagawarecki/351/base -> origin/gh/mikaylagawarecki/351/base 2025-12-04T08:56:59.3535043Z * [new branch] gh/mikaylagawarecki/351/head -> origin/gh/mikaylagawarecki/351/head 2025-12-04T08:56:59.3536624Z * [new branch] gh/mikaylagawarecki/351/orig -> origin/gh/mikaylagawarecki/351/orig 2025-12-04T08:56:59.3538933Z * [new branch] gh/mikaylagawarecki/352/base -> origin/gh/mikaylagawarecki/352/base 2025-12-04T08:56:59.3540694Z * [new branch] gh/mikaylagawarecki/352/head -> origin/gh/mikaylagawarecki/352/head 2025-12-04T08:56:59.3542210Z * [new branch] gh/mikaylagawarecki/352/orig -> origin/gh/mikaylagawarecki/352/orig 2025-12-04T08:56:59.3544556Z * [new branch] gh/mikaylagawarecki/353/base -> origin/gh/mikaylagawarecki/353/base 2025-12-04T08:56:59.3546255Z * [new branch] gh/mikaylagawarecki/353/head -> origin/gh/mikaylagawarecki/353/head 2025-12-04T08:56:59.3547746Z * [new branch] gh/mikaylagawarecki/353/orig -> origin/gh/mikaylagawarecki/353/orig 2025-12-04T08:56:59.3549696Z * [new branch] gh/mikaylagawarecki/354/base -> origin/gh/mikaylagawarecki/354/base 2025-12-04T08:56:59.3551777Z * [new branch] gh/mikaylagawarecki/354/head -> origin/gh/mikaylagawarecki/354/head 2025-12-04T08:56:59.3553355Z * [new branch] gh/mikaylagawarecki/354/orig -> origin/gh/mikaylagawarecki/354/orig 2025-12-04T08:56:59.3555952Z * [new branch] gh/mikaylagawarecki/356/base -> origin/gh/mikaylagawarecki/356/base 2025-12-04T08:56:59.3558004Z * [new branch] gh/mikaylagawarecki/356/head -> origin/gh/mikaylagawarecki/356/head 2025-12-04T08:56:59.3560009Z * [new branch] gh/mikaylagawarecki/356/orig -> origin/gh/mikaylagawarecki/356/orig 2025-12-04T08:56:59.3562344Z * [new branch] gh/mikaylagawarecki/357/base -> origin/gh/mikaylagawarecki/357/base 2025-12-04T08:56:59.3563922Z * [new branch] gh/mikaylagawarecki/357/head -> origin/gh/mikaylagawarecki/357/head 2025-12-04T08:56:59.3565493Z * [new branch] gh/mikaylagawarecki/357/orig -> origin/gh/mikaylagawarecki/357/orig 2025-12-04T08:56:59.3567731Z * [new branch] gh/mikaylagawarecki/359/base -> origin/gh/mikaylagawarecki/359/base 2025-12-04T08:56:59.3569345Z * [new branch] gh/mikaylagawarecki/359/head -> origin/gh/mikaylagawarecki/359/head 2025-12-04T08:56:59.3570891Z * [new branch] gh/mikaylagawarecki/359/orig -> origin/gh/mikaylagawarecki/359/orig 2025-12-04T08:56:59.3573260Z * [new branch] gh/mikaylagawarecki/360/base -> origin/gh/mikaylagawarecki/360/base 2025-12-04T08:56:59.3574842Z * [new branch] gh/mikaylagawarecki/360/head -> origin/gh/mikaylagawarecki/360/head 2025-12-04T08:56:59.3576465Z * [new branch] gh/mikaylagawarecki/360/orig -> origin/gh/mikaylagawarecki/360/orig 2025-12-04T08:56:59.3578641Z * [new branch] gh/mikaylagawarecki/361/base -> origin/gh/mikaylagawarecki/361/base 2025-12-04T08:56:59.3580198Z * [new branch] gh/mikaylagawarecki/361/head -> origin/gh/mikaylagawarecki/361/head 2025-12-04T08:56:59.3581759Z * [new branch] gh/mikaylagawarecki/361/orig -> origin/gh/mikaylagawarecki/361/orig 2025-12-04T08:56:59.3584009Z * [new branch] gh/mikaylagawarecki/362/base -> origin/gh/mikaylagawarecki/362/base 2025-12-04T08:56:59.3585718Z * [new branch] gh/mikaylagawarecki/362/head -> origin/gh/mikaylagawarecki/362/head 2025-12-04T08:56:59.3587240Z * [new branch] gh/mikaylagawarecki/362/orig -> origin/gh/mikaylagawarecki/362/orig 2025-12-04T08:56:59.3589693Z * [new branch] gh/mikaylagawarecki/363/base -> origin/gh/mikaylagawarecki/363/base 2025-12-04T08:56:59.3591376Z * [new branch] gh/mikaylagawarecki/363/head -> origin/gh/mikaylagawarecki/363/head 2025-12-04T08:56:59.3593022Z * [new branch] gh/mikaylagawarecki/363/orig -> origin/gh/mikaylagawarecki/363/orig 2025-12-04T08:56:59.3595480Z * [new branch] gh/mikaylagawarecki/364/base -> origin/gh/mikaylagawarecki/364/base 2025-12-04T08:56:59.3596974Z * [new branch] gh/mikaylagawarecki/364/head -> origin/gh/mikaylagawarecki/364/head 2025-12-04T08:56:59.3598592Z * [new branch] gh/mikaylagawarecki/364/orig -> origin/gh/mikaylagawarecki/364/orig 2025-12-04T08:56:59.3600879Z * [new branch] gh/mikaylagawarecki/365/base -> origin/gh/mikaylagawarecki/365/base 2025-12-04T08:56:59.3602441Z * [new branch] gh/mikaylagawarecki/365/head -> origin/gh/mikaylagawarecki/365/head 2025-12-04T08:56:59.3604224Z * [new branch] gh/mikaylagawarecki/365/orig -> origin/gh/mikaylagawarecki/365/orig 2025-12-04T08:56:59.3606366Z * [new branch] gh/mikaylagawarecki/366/base -> origin/gh/mikaylagawarecki/366/base 2025-12-04T08:56:59.3607829Z * [new branch] gh/mikaylagawarecki/366/head -> origin/gh/mikaylagawarecki/366/head 2025-12-04T08:56:59.3609565Z * [new branch] gh/mikaylagawarecki/366/orig -> origin/gh/mikaylagawarecki/366/orig 2025-12-04T08:56:59.3611702Z * [new branch] gh/mikaylagawarecki/367/base -> origin/gh/mikaylagawarecki/367/base 2025-12-04T08:56:59.3613470Z * [new branch] gh/mikaylagawarecki/367/head -> origin/gh/mikaylagawarecki/367/head 2025-12-04T08:56:59.3615112Z * [new branch] gh/mikaylagawarecki/367/orig -> origin/gh/mikaylagawarecki/367/orig 2025-12-04T08:56:59.3617284Z * [new branch] gh/mikaylagawarecki/368/base -> origin/gh/mikaylagawarecki/368/base 2025-12-04T08:56:59.3618872Z * [new branch] gh/mikaylagawarecki/368/head -> origin/gh/mikaylagawarecki/368/head 2025-12-04T08:56:59.3620450Z * [new branch] gh/mikaylagawarecki/368/orig -> origin/gh/mikaylagawarecki/368/orig 2025-12-04T08:56:59.3622630Z * [new branch] gh/mikaylagawarecki/369/base -> origin/gh/mikaylagawarecki/369/base 2025-12-04T08:56:59.3624231Z * [new branch] gh/mikaylagawarecki/369/head -> origin/gh/mikaylagawarecki/369/head 2025-12-04T08:56:59.3625761Z * [new branch] gh/mikaylagawarecki/369/orig -> origin/gh/mikaylagawarecki/369/orig 2025-12-04T08:56:59.3628074Z * [new branch] gh/mikaylagawarecki/370/base -> origin/gh/mikaylagawarecki/370/base 2025-12-04T08:56:59.3629627Z * [new branch] gh/mikaylagawarecki/370/head -> origin/gh/mikaylagawarecki/370/head 2025-12-04T08:56:59.3631153Z * [new branch] gh/mikaylagawarecki/370/orig -> origin/gh/mikaylagawarecki/370/orig 2025-12-04T08:56:59.3633304Z * [new branch] gh/mikaylagawarecki/371/base -> origin/gh/mikaylagawarecki/371/base 2025-12-04T08:56:59.3634810Z * [new branch] gh/mikaylagawarecki/371/head -> origin/gh/mikaylagawarecki/371/head 2025-12-04T08:56:59.3636402Z * [new branch] gh/mikaylagawarecki/371/orig -> origin/gh/mikaylagawarecki/371/orig 2025-12-04T08:56:59.3638699Z * [new branch] gh/mikaylagawarecki/372/base -> origin/gh/mikaylagawarecki/372/base 2025-12-04T08:56:59.3640291Z * [new branch] gh/mikaylagawarecki/372/head -> origin/gh/mikaylagawarecki/372/head 2025-12-04T08:56:59.3641825Z * [new branch] gh/mikaylagawarecki/372/orig -> origin/gh/mikaylagawarecki/372/orig 2025-12-04T08:56:59.3643880Z * [new branch] gh/mikaylagawarecki/373/base -> origin/gh/mikaylagawarecki/373/base 2025-12-04T08:56:59.3645400Z * [new branch] gh/mikaylagawarecki/373/head -> origin/gh/mikaylagawarecki/373/head 2025-12-04T08:56:59.3646911Z * [new branch] gh/mikaylagawarecki/373/orig -> origin/gh/mikaylagawarecki/373/orig 2025-12-04T08:56:59.3649124Z * [new branch] gh/mikaylagawarecki/374/base -> origin/gh/mikaylagawarecki/374/base 2025-12-04T08:56:59.3650808Z * [new branch] gh/mikaylagawarecki/374/head -> origin/gh/mikaylagawarecki/374/head 2025-12-04T08:56:59.3652393Z * [new branch] gh/mikaylagawarecki/374/orig -> origin/gh/mikaylagawarecki/374/orig 2025-12-04T08:56:59.3654659Z * [new branch] gh/mikaylagawarecki/375/base -> origin/gh/mikaylagawarecki/375/base 2025-12-04T08:56:59.3656277Z * [new branch] gh/mikaylagawarecki/375/head -> origin/gh/mikaylagawarecki/375/head 2025-12-04T08:56:59.3657825Z * [new branch] gh/mikaylagawarecki/375/orig -> origin/gh/mikaylagawarecki/375/orig 2025-12-04T08:56:59.3659951Z * [new branch] gh/mikaylagawarecki/376/base -> origin/gh/mikaylagawarecki/376/base 2025-12-04T08:56:59.3661579Z * [new branch] gh/mikaylagawarecki/376/head -> origin/gh/mikaylagawarecki/376/head 2025-12-04T08:56:59.3664717Z * [new branch] gh/mikaylagawarecki/376/orig -> origin/gh/mikaylagawarecki/376/orig 2025-12-04T08:56:59.3666739Z * [new branch] gh/mikaylagawarecki/377/base -> origin/gh/mikaylagawarecki/377/base 2025-12-04T08:56:59.3668319Z * [new branch] gh/mikaylagawarecki/377/head -> origin/gh/mikaylagawarecki/377/head 2025-12-04T08:56:59.3669845Z * [new branch] gh/mikaylagawarecki/377/orig -> origin/gh/mikaylagawarecki/377/orig 2025-12-04T08:56:59.3672114Z * [new branch] gh/mikaylagawarecki/378/base -> origin/gh/mikaylagawarecki/378/base 2025-12-04T08:56:59.3673664Z * [new branch] gh/mikaylagawarecki/378/head -> origin/gh/mikaylagawarecki/378/head 2025-12-04T08:56:59.3675216Z * [new branch] gh/mikaylagawarecki/378/orig -> origin/gh/mikaylagawarecki/378/orig 2025-12-04T08:56:59.3677412Z * [new branch] gh/mikaylagawarecki/379/base -> origin/gh/mikaylagawarecki/379/base 2025-12-04T08:56:59.3678991Z * [new branch] gh/mikaylagawarecki/379/head -> origin/gh/mikaylagawarecki/379/head 2025-12-04T08:56:59.3680516Z * [new branch] gh/mikaylagawarecki/379/orig -> origin/gh/mikaylagawarecki/379/orig 2025-12-04T08:56:59.3682572Z * [new branch] gh/mikaylagawarecki/380/base -> origin/gh/mikaylagawarecki/380/base 2025-12-04T08:56:59.3684101Z * [new branch] gh/mikaylagawarecki/380/head -> origin/gh/mikaylagawarecki/380/head 2025-12-04T08:56:59.3685649Z * [new branch] gh/mikaylagawarecki/380/orig -> origin/gh/mikaylagawarecki/380/orig 2025-12-04T08:56:59.3687619Z * [new branch] gh/mikaylagawarecki/381/base -> origin/gh/mikaylagawarecki/381/base 2025-12-04T08:56:59.3689244Z * [new branch] gh/mikaylagawarecki/381/head -> origin/gh/mikaylagawarecki/381/head 2025-12-04T08:56:59.3690779Z * [new branch] gh/mikaylagawarecki/381/orig -> origin/gh/mikaylagawarecki/381/orig 2025-12-04T08:56:59.3693121Z * [new branch] gh/mikaylagawarecki/382/base -> origin/gh/mikaylagawarecki/382/base 2025-12-04T08:56:59.3694714Z * [new branch] gh/mikaylagawarecki/382/head -> origin/gh/mikaylagawarecki/382/head 2025-12-04T08:56:59.3696409Z * [new branch] gh/mikaylagawarecki/382/orig -> origin/gh/mikaylagawarecki/382/orig 2025-12-04T08:56:59.3698606Z * [new branch] gh/mikaylagawarecki/383/base -> origin/gh/mikaylagawarecki/383/base 2025-12-04T08:56:59.3700219Z * [new branch] gh/mikaylagawarecki/383/head -> origin/gh/mikaylagawarecki/383/head 2025-12-04T08:56:59.3701794Z * [new branch] gh/mikaylagawarecki/383/orig -> origin/gh/mikaylagawarecki/383/orig 2025-12-04T08:56:59.3703854Z * [new branch] gh/mikaylagawarecki/384/base -> origin/gh/mikaylagawarecki/384/base 2025-12-04T08:56:59.3705423Z * [new branch] gh/mikaylagawarecki/384/head -> origin/gh/mikaylagawarecki/384/head 2025-12-04T08:56:59.3706951Z * [new branch] gh/mikaylagawarecki/384/orig -> origin/gh/mikaylagawarecki/384/orig 2025-12-04T08:56:59.3709129Z * [new branch] gh/mikaylagawarecki/385/base -> origin/gh/mikaylagawarecki/385/base 2025-12-04T08:56:59.3710749Z * [new branch] gh/mikaylagawarecki/385/head -> origin/gh/mikaylagawarecki/385/head 2025-12-04T08:56:59.3712280Z * [new branch] gh/mikaylagawarecki/385/orig -> origin/gh/mikaylagawarecki/385/orig 2025-12-04T08:56:59.3714708Z * [new branch] gh/mikaylagawarecki/386/base -> origin/gh/mikaylagawarecki/386/base 2025-12-04T08:56:59.3716219Z * [new branch] gh/mikaylagawarecki/386/head -> origin/gh/mikaylagawarecki/386/head 2025-12-04T08:56:59.3717788Z * [new branch] gh/mikaylagawarecki/386/orig -> origin/gh/mikaylagawarecki/386/orig 2025-12-04T08:56:59.3719943Z * [new branch] gh/mikaylagawarecki/387/base -> origin/gh/mikaylagawarecki/387/base 2025-12-04T08:56:59.3721677Z * [new branch] gh/mikaylagawarecki/387/head -> origin/gh/mikaylagawarecki/387/head 2025-12-04T08:56:59.3723122Z * [new branch] gh/mikaylagawarecki/387/orig -> origin/gh/mikaylagawarecki/387/orig 2025-12-04T08:56:59.3725147Z * [new branch] gh/mikaylagawarecki/388/base -> origin/gh/mikaylagawarecki/388/base 2025-12-04T08:56:59.3726747Z * [new branch] gh/mikaylagawarecki/388/head -> origin/gh/mikaylagawarecki/388/head 2025-12-04T08:56:59.3728306Z * [new branch] gh/mikaylagawarecki/388/orig -> origin/gh/mikaylagawarecki/388/orig 2025-12-04T08:56:59.3730473Z * [new branch] gh/mikaylagawarecki/389/base -> origin/gh/mikaylagawarecki/389/base 2025-12-04T08:56:59.3731995Z * [new branch] gh/mikaylagawarecki/389/head -> origin/gh/mikaylagawarecki/389/head 2025-12-04T08:56:59.3733724Z * [new branch] gh/mikaylagawarecki/389/orig -> origin/gh/mikaylagawarecki/389/orig 2025-12-04T08:56:59.3736425Z * [new branch] gh/mikaylagawarecki/390/base -> origin/gh/mikaylagawarecki/390/base 2025-12-04T08:56:59.3737929Z * [new branch] gh/mikaylagawarecki/390/head -> origin/gh/mikaylagawarecki/390/head 2025-12-04T08:56:59.3739546Z * [new branch] gh/mikaylagawarecki/390/orig -> origin/gh/mikaylagawarecki/390/orig 2025-12-04T08:56:59.3741825Z * [new branch] gh/mikaylagawarecki/391/base -> origin/gh/mikaylagawarecki/391/base 2025-12-04T08:56:59.3743427Z * [new branch] gh/mikaylagawarecki/391/head -> origin/gh/mikaylagawarecki/391/head 2025-12-04T08:56:59.3744951Z * [new branch] gh/mikaylagawarecki/391/orig -> origin/gh/mikaylagawarecki/391/orig 2025-12-04T08:56:59.3747236Z * [new branch] gh/mikaylagawarecki/392/base -> origin/gh/mikaylagawarecki/392/base 2025-12-04T08:56:59.3748848Z * [new branch] gh/mikaylagawarecki/392/head -> origin/gh/mikaylagawarecki/392/head 2025-12-04T08:56:59.3750388Z * [new branch] gh/mikaylagawarecki/392/orig -> origin/gh/mikaylagawarecki/392/orig 2025-12-04T08:56:59.3752918Z * [new branch] gh/mlazos/41/base -> origin/gh/mlazos/41/base 2025-12-04T08:56:59.3754407Z * [new branch] gh/mlazos/41/head -> origin/gh/mlazos/41/head 2025-12-04T08:56:59.3755927Z * [new branch] gh/mlazos/41/orig -> origin/gh/mlazos/41/orig 2025-12-04T08:56:59.3758097Z * [new branch] gh/mlazos/42/base -> origin/gh/mlazos/42/base 2025-12-04T08:56:59.3759616Z * [new branch] gh/mlazos/42/head -> origin/gh/mlazos/42/head 2025-12-04T08:56:59.3761139Z * [new branch] gh/mlazos/42/orig -> origin/gh/mlazos/42/orig 2025-12-04T08:56:59.3763429Z * [new branch] gh/mlazos/43/base -> origin/gh/mlazos/43/base 2025-12-04T08:56:59.3764919Z * [new branch] gh/mlazos/43/head -> origin/gh/mlazos/43/head 2025-12-04T08:56:59.3766495Z * [new branch] gh/mlazos/43/orig -> origin/gh/mlazos/43/orig 2025-12-04T08:56:59.3768431Z * [new branch] gh/mlazos/44/base -> origin/gh/mlazos/44/base 2025-12-04T08:56:59.3770448Z * [new branch] gh/mlazos/44/head -> origin/gh/mlazos/44/head 2025-12-04T08:56:59.3772040Z * [new branch] gh/mlazos/44/orig -> origin/gh/mlazos/44/orig 2025-12-04T08:56:59.3774228Z * [new branch] gh/mlazos/47/base -> origin/gh/mlazos/47/base 2025-12-04T08:56:59.3775741Z * [new branch] gh/mlazos/47/head -> origin/gh/mlazos/47/head 2025-12-04T08:56:59.3777289Z * [new branch] gh/mlazos/47/orig -> origin/gh/mlazos/47/orig 2025-12-04T08:56:59.3779831Z * [new branch] gh/mlazos/48/base -> origin/gh/mlazos/48/base 2025-12-04T08:56:59.3781398Z * [new branch] gh/mlazos/48/head -> origin/gh/mlazos/48/head 2025-12-04T08:56:59.3783115Z * [new branch] gh/mlazos/48/orig -> origin/gh/mlazos/48/orig 2025-12-04T08:56:59.3785080Z * [new branch] gh/mlazos/49/base -> origin/gh/mlazos/49/base 2025-12-04T08:56:59.3786620Z * [new branch] gh/mlazos/49/head -> origin/gh/mlazos/49/head 2025-12-04T08:56:59.3788087Z * [new branch] gh/mlazos/49/orig -> origin/gh/mlazos/49/orig 2025-12-04T08:56:59.3790145Z * [new branch] gh/mlazos/50/base -> origin/gh/mlazos/50/base 2025-12-04T08:56:59.3791856Z * [new branch] gh/mlazos/50/head -> origin/gh/mlazos/50/head 2025-12-04T08:56:59.3793918Z * [new branch] gh/mlazos/50/orig -> origin/gh/mlazos/50/orig 2025-12-04T08:56:59.3796981Z * [new branch] gh/mlazos/51/base -> origin/gh/mlazos/51/base 2025-12-04T08:56:59.3798537Z * [new branch] gh/mlazos/51/head -> origin/gh/mlazos/51/head 2025-12-04T08:56:59.3800237Z * [new branch] gh/mlazos/51/orig -> origin/gh/mlazos/51/orig 2025-12-04T08:56:59.3802903Z * [new branch] gh/mlazos/52/base -> origin/gh/mlazos/52/base 2025-12-04T08:56:59.3804481Z * [new branch] gh/mlazos/52/head -> origin/gh/mlazos/52/head 2025-12-04T08:56:59.3806041Z * [new branch] gh/mlazos/52/orig -> origin/gh/mlazos/52/orig 2025-12-04T08:56:59.3808573Z * [new branch] gh/mlazos/53/base -> origin/gh/mlazos/53/base 2025-12-04T08:56:59.3810118Z * [new branch] gh/mlazos/53/head -> origin/gh/mlazos/53/head 2025-12-04T08:56:59.3811671Z * [new branch] gh/mlazos/53/orig -> origin/gh/mlazos/53/orig 2025-12-04T08:56:59.3813991Z * [new branch] gh/mlazos/54/base -> origin/gh/mlazos/54/base 2025-12-04T08:56:59.3815438Z * [new branch] gh/mlazos/54/head -> origin/gh/mlazos/54/head 2025-12-04T08:56:59.3816956Z * [new branch] gh/mlazos/54/orig -> origin/gh/mlazos/54/orig 2025-12-04T08:56:59.3818978Z * [new branch] gh/mlazos/55/base -> origin/gh/mlazos/55/base 2025-12-04T08:56:59.3820617Z * [new branch] gh/mlazos/55/head -> origin/gh/mlazos/55/head 2025-12-04T08:56:59.3822158Z * [new branch] gh/mlazos/55/orig -> origin/gh/mlazos/55/orig 2025-12-04T08:56:59.3824324Z * [new branch] gh/mlazos/56/base -> origin/gh/mlazos/56/base 2025-12-04T08:56:59.3825895Z * [new branch] gh/mlazos/56/head -> origin/gh/mlazos/56/head 2025-12-04T08:56:59.3827444Z * [new branch] gh/mlazos/56/orig -> origin/gh/mlazos/56/orig 2025-12-04T08:56:59.3829462Z * [new branch] gh/mlazos/57/base -> origin/gh/mlazos/57/base 2025-12-04T08:56:59.3830973Z * [new branch] gh/mlazos/57/head -> origin/gh/mlazos/57/head 2025-12-04T08:56:59.3832585Z * [new branch] gh/mlazos/57/orig -> origin/gh/mlazos/57/orig 2025-12-04T08:56:59.3834668Z * [new branch] gh/mlazos/58/base -> origin/gh/mlazos/58/base 2025-12-04T08:56:59.3836259Z * [new branch] gh/mlazos/58/head -> origin/gh/mlazos/58/head 2025-12-04T08:56:59.3837830Z * [new branch] gh/mlazos/58/orig -> origin/gh/mlazos/58/orig 2025-12-04T08:56:59.3839888Z * [new branch] gh/mlazos/59/base -> origin/gh/mlazos/59/base 2025-12-04T08:56:59.3841451Z * [new branch] gh/mlazos/59/head -> origin/gh/mlazos/59/head 2025-12-04T08:56:59.3842866Z * [new branch] gh/mlazos/59/orig -> origin/gh/mlazos/59/orig 2025-12-04T08:56:59.3845111Z * [new branch] gh/mlazos/60/base -> origin/gh/mlazos/60/base 2025-12-04T08:56:59.3846626Z * [new branch] gh/mlazos/60/head -> origin/gh/mlazos/60/head 2025-12-04T08:56:59.3848316Z * [new branch] gh/mlazos/60/orig -> origin/gh/mlazos/60/orig 2025-12-04T08:56:59.3850754Z * [new branch] gh/mlazos/61/base -> origin/gh/mlazos/61/base 2025-12-04T08:56:59.3852367Z * [new branch] gh/mlazos/61/head -> origin/gh/mlazos/61/head 2025-12-04T08:56:59.3853975Z * [new branch] gh/mlazos/61/orig -> origin/gh/mlazos/61/orig 2025-12-04T08:56:59.3856096Z * [new branch] gh/mlazos/62/base -> origin/gh/mlazos/62/base 2025-12-04T08:56:59.3857625Z * [new branch] gh/mlazos/62/head -> origin/gh/mlazos/62/head 2025-12-04T08:56:59.3859150Z * [new branch] gh/mlazos/62/orig -> origin/gh/mlazos/62/orig 2025-12-04T08:56:59.3861338Z * [new branch] gh/mlazos/63/base -> origin/gh/mlazos/63/base 2025-12-04T08:56:59.3863198Z * [new branch] gh/mlazos/63/head -> origin/gh/mlazos/63/head 2025-12-04T08:56:59.3864806Z * [new branch] gh/mlazos/63/orig -> origin/gh/mlazos/63/orig 2025-12-04T08:56:59.3867010Z * [new branch] gh/mlazos/64/base -> origin/gh/mlazos/64/base 2025-12-04T08:56:59.3868547Z * [new branch] gh/mlazos/64/head -> origin/gh/mlazos/64/head 2025-12-04T08:56:59.3870085Z * [new branch] gh/mlazos/64/orig -> origin/gh/mlazos/64/orig 2025-12-04T08:56:59.3872208Z * [new branch] gh/mlazos/65/base -> origin/gh/mlazos/65/base 2025-12-04T08:56:59.3873776Z * [new branch] gh/mlazos/65/head -> origin/gh/mlazos/65/head 2025-12-04T08:56:59.3875319Z * [new branch] gh/mlazos/65/orig -> origin/gh/mlazos/65/orig 2025-12-04T08:56:59.3877520Z * [new branch] gh/mlazos/66/base -> origin/gh/mlazos/66/base 2025-12-04T08:56:59.3879055Z * [new branch] gh/mlazos/66/head -> origin/gh/mlazos/66/head 2025-12-04T08:56:59.3880602Z * [new branch] gh/mlazos/66/orig -> origin/gh/mlazos/66/orig 2025-12-04T08:56:59.3882763Z * [new branch] gh/mlazos/67/base -> origin/gh/mlazos/67/base 2025-12-04T08:56:59.3884335Z * [new branch] gh/mlazos/67/head -> origin/gh/mlazos/67/head 2025-12-04T08:56:59.3885884Z * [new branch] gh/mlazos/67/orig -> origin/gh/mlazos/67/orig 2025-12-04T08:56:59.3888073Z * [new branch] gh/mlazos/68/base -> origin/gh/mlazos/68/base 2025-12-04T08:56:59.3889615Z * [new branch] gh/mlazos/68/head -> origin/gh/mlazos/68/head 2025-12-04T08:56:59.3891701Z * [new branch] gh/mlazos/68/orig -> origin/gh/mlazos/68/orig 2025-12-04T08:56:59.3893925Z * [new branch] gh/mlazos/69/base -> origin/gh/mlazos/69/base 2025-12-04T08:56:59.3895476Z * [new branch] gh/mlazos/69/head -> origin/gh/mlazos/69/head 2025-12-04T08:56:59.3897103Z * [new branch] gh/mlazos/69/orig -> origin/gh/mlazos/69/orig 2025-12-04T08:56:59.3899204Z * [new branch] gh/mlazos/70/base -> origin/gh/mlazos/70/base 2025-12-04T08:56:59.3900716Z * [new branch] gh/mlazos/70/head -> origin/gh/mlazos/70/head 2025-12-04T08:56:59.3902242Z * [new branch] gh/mlazos/70/orig -> origin/gh/mlazos/70/orig 2025-12-04T08:56:59.3904500Z * [new branch] gh/mlazos/71/base -> origin/gh/mlazos/71/base 2025-12-04T08:56:59.3906074Z * [new branch] gh/mlazos/71/head -> origin/gh/mlazos/71/head 2025-12-04T08:56:59.3907606Z * [new branch] gh/mlazos/71/orig -> origin/gh/mlazos/71/orig 2025-12-04T08:56:59.3909762Z * [new branch] gh/mlazos/72/base -> origin/gh/mlazos/72/base 2025-12-04T08:56:59.3911313Z * [new branch] gh/mlazos/72/head -> origin/gh/mlazos/72/head 2025-12-04T08:56:59.3914618Z * [new branch] gh/mlazos/72/orig -> origin/gh/mlazos/72/orig 2025-12-04T08:56:59.3915445Z * [new branch] gh/mlazos/73/base -> origin/gh/mlazos/73/base 2025-12-04T08:56:59.3916586Z * [new branch] gh/mlazos/73/head -> origin/gh/mlazos/73/head 2025-12-04T08:56:59.3918117Z * [new branch] gh/mlazos/73/orig -> origin/gh/mlazos/73/orig 2025-12-04T08:56:59.3920638Z * [new branch] gh/mrmiywj/1/base -> origin/gh/mrmiywj/1/base 2025-12-04T08:56:59.3922333Z * [new branch] gh/mrmiywj/1/head -> origin/gh/mrmiywj/1/head 2025-12-04T08:56:59.3924898Z * [new branch] gh/muchulee8/73/base -> origin/gh/muchulee8/73/base 2025-12-04T08:56:59.3926567Z * [new branch] gh/muchulee8/73/head -> origin/gh/muchulee8/73/head 2025-12-04T08:56:59.3928219Z * [new branch] gh/muchulee8/73/orig -> origin/gh/muchulee8/73/orig 2025-12-04T08:56:59.3930833Z * [new branch] gh/naveenthangudu/1/base -> origin/gh/naveenthangudu/1/base 2025-12-04T08:56:59.3932523Z * [new branch] gh/naveenthangudu/1/head -> origin/gh/naveenthangudu/1/head 2025-12-04T08:56:59.3934201Z * [new branch] gh/naveenthangudu/1/orig -> origin/gh/naveenthangudu/1/orig 2025-12-04T08:56:59.3936238Z * [new branch] gh/naveenthangudu/2/base -> origin/gh/naveenthangudu/2/base 2025-12-04T08:56:59.3937856Z * [new branch] gh/naveenthangudu/2/head -> origin/gh/naveenthangudu/2/head 2025-12-04T08:56:59.3939438Z * [new branch] gh/naveenthangudu/2/orig -> origin/gh/naveenthangudu/2/orig 2025-12-04T08:56:59.3941410Z * [new branch] gh/naveenthangudu/3/base -> origin/gh/naveenthangudu/3/base 2025-12-04T08:56:59.3942885Z * [new branch] gh/naveenthangudu/3/head -> origin/gh/naveenthangudu/3/head 2025-12-04T08:56:59.3944420Z * [new branch] gh/naveenthangudu/3/orig -> origin/gh/naveenthangudu/3/orig 2025-12-04T08:56:59.3946473Z * [new branch] gh/naveenthangudu/4/base -> origin/gh/naveenthangudu/4/base 2025-12-04T08:56:59.3948242Z * [new branch] gh/naveenthangudu/4/head -> origin/gh/naveenthangudu/4/head 2025-12-04T08:56:59.3949864Z * [new branch] gh/naveenthangudu/4/orig -> origin/gh/naveenthangudu/4/orig 2025-12-04T08:56:59.3951928Z * [new branch] gh/naveenthangudu/5/base -> origin/gh/naveenthangudu/5/base 2025-12-04T08:56:59.3953475Z * [new branch] gh/naveenthangudu/5/head -> origin/gh/naveenthangudu/5/head 2025-12-04T08:56:59.3955246Z * [new branch] gh/naveenthangudu/5/orig -> origin/gh/naveenthangudu/5/orig 2025-12-04T08:56:59.3957311Z * [new branch] gh/naveenthangudu/6/base -> origin/gh/naveenthangudu/6/base 2025-12-04T08:56:59.3958889Z * [new branch] gh/naveenthangudu/6/head -> origin/gh/naveenthangudu/6/head 2025-12-04T08:56:59.3960372Z * [new branch] gh/naveenthangudu/6/orig -> origin/gh/naveenthangudu/6/orig 2025-12-04T08:56:59.3963165Z * [new branch] gh/naveenthangudu/7/base -> origin/gh/naveenthangudu/7/base 2025-12-04T08:56:59.3964755Z * [new branch] gh/naveenthangudu/7/head -> origin/gh/naveenthangudu/7/head 2025-12-04T08:56:59.3966222Z * [new branch] gh/naveenthangudu/7/orig -> origin/gh/naveenthangudu/7/orig 2025-12-04T08:56:59.3968209Z * [new branch] gh/naveenthangudu/8/base -> origin/gh/naveenthangudu/8/base 2025-12-04T08:56:59.3969840Z * [new branch] gh/naveenthangudu/8/head -> origin/gh/naveenthangudu/8/head 2025-12-04T08:56:59.3971464Z * [new branch] gh/naveenthangudu/8/orig -> origin/gh/naveenthangudu/8/orig 2025-12-04T08:56:59.3973703Z * [new branch] gh/naveenthangudu/9/base -> origin/gh/naveenthangudu/9/base 2025-12-04T08:56:59.3975249Z * [new branch] gh/naveenthangudu/9/head -> origin/gh/naveenthangudu/9/head 2025-12-04T08:56:59.3976939Z * [new branch] gh/naveenthangudu/9/orig -> origin/gh/naveenthangudu/9/orig 2025-12-04T08:56:59.3979288Z * [new branch] gh/nikitaved/1/base -> origin/gh/nikitaved/1/base 2025-12-04T08:56:59.3980824Z * [new branch] gh/nikitaved/1/head -> origin/gh/nikitaved/1/head 2025-12-04T08:56:59.3982488Z * [new branch] gh/nikitaved/1/orig -> origin/gh/nikitaved/1/orig 2025-12-04T08:56:59.3984596Z * [new branch] gh/nikitaved/10/base -> origin/gh/nikitaved/10/base 2025-12-04T08:56:59.3986276Z * [new branch] gh/nikitaved/10/head -> origin/gh/nikitaved/10/head 2025-12-04T08:56:59.3987808Z * [new branch] gh/nikitaved/10/orig -> origin/gh/nikitaved/10/orig 2025-12-04T08:56:59.3989896Z * [new branch] gh/nikitaved/11/base -> origin/gh/nikitaved/11/base 2025-12-04T08:56:59.3991485Z * [new branch] gh/nikitaved/11/head -> origin/gh/nikitaved/11/head 2025-12-04T08:56:59.3992915Z * [new branch] gh/nikitaved/11/orig -> origin/gh/nikitaved/11/orig 2025-12-04T08:56:59.3994917Z * [new branch] gh/nikitaved/12/base -> origin/gh/nikitaved/12/base 2025-12-04T08:56:59.3996562Z * [new branch] gh/nikitaved/12/head -> origin/gh/nikitaved/12/head 2025-12-04T08:56:59.3997973Z * [new branch] gh/nikitaved/12/orig -> origin/gh/nikitaved/12/orig 2025-12-04T08:56:59.4000134Z * [new branch] gh/nikitaved/13/base -> origin/gh/nikitaved/13/base 2025-12-04T08:56:59.4001679Z * [new branch] gh/nikitaved/13/head -> origin/gh/nikitaved/13/head 2025-12-04T08:56:59.4003224Z * [new branch] gh/nikitaved/13/orig -> origin/gh/nikitaved/13/orig 2025-12-04T08:56:59.4005363Z * [new branch] gh/nikitaved/14/base -> origin/gh/nikitaved/14/base 2025-12-04T08:56:59.4006944Z * [new branch] gh/nikitaved/14/head -> origin/gh/nikitaved/14/head 2025-12-04T08:56:59.4008928Z * [new branch] gh/nikitaved/14/orig -> origin/gh/nikitaved/14/orig 2025-12-04T08:56:59.4011023Z * [new branch] gh/nikitaved/15/base -> origin/gh/nikitaved/15/base 2025-12-04T08:56:59.4012614Z * [new branch] gh/nikitaved/15/head -> origin/gh/nikitaved/15/head 2025-12-04T08:56:59.4014217Z * [new branch] gh/nikitaved/15/orig -> origin/gh/nikitaved/15/orig 2025-12-04T08:56:59.4016238Z * [new branch] gh/nikitaved/16/base -> origin/gh/nikitaved/16/base 2025-12-04T08:56:59.4017764Z * [new branch] gh/nikitaved/16/head -> origin/gh/nikitaved/16/head 2025-12-04T08:56:59.4019315Z * [new branch] gh/nikitaved/16/orig -> origin/gh/nikitaved/16/orig 2025-12-04T08:56:59.4021425Z * [new branch] gh/nikitaved/2/base -> origin/gh/nikitaved/2/base 2025-12-04T08:56:59.4022991Z * [new branch] gh/nikitaved/2/head -> origin/gh/nikitaved/2/head 2025-12-04T08:56:59.4024544Z * [new branch] gh/nikitaved/2/orig -> origin/gh/nikitaved/2/orig 2025-12-04T08:56:59.4026783Z * [new branch] gh/nikitaved/4/base -> origin/gh/nikitaved/4/base 2025-12-04T08:56:59.4028423Z * [new branch] gh/nikitaved/4/head -> origin/gh/nikitaved/4/head 2025-12-04T08:56:59.4029948Z * [new branch] gh/nikitaved/4/orig -> origin/gh/nikitaved/4/orig 2025-12-04T08:56:59.4032159Z * [new branch] gh/nikitaved/5/base -> origin/gh/nikitaved/5/base 2025-12-04T08:56:59.4033725Z * [new branch] gh/nikitaved/5/head -> origin/gh/nikitaved/5/head 2025-12-04T08:56:59.4035300Z * [new branch] gh/nikitaved/5/orig -> origin/gh/nikitaved/5/orig 2025-12-04T08:56:59.4037346Z * [new branch] gh/nikitaved/6/base -> origin/gh/nikitaved/6/base 2025-12-04T08:56:59.4038994Z * [new branch] gh/nikitaved/6/head -> origin/gh/nikitaved/6/head 2025-12-04T08:56:59.4040427Z * [new branch] gh/nikitaved/6/orig -> origin/gh/nikitaved/6/orig 2025-12-04T08:56:59.4042468Z * [new branch] gh/nikitaved/8/base -> origin/gh/nikitaved/8/base 2025-12-04T08:56:59.4044044Z * [new branch] gh/nikitaved/8/head -> origin/gh/nikitaved/8/head 2025-12-04T08:56:59.4045552Z * [new branch] gh/nikitaved/8/orig -> origin/gh/nikitaved/8/orig 2025-12-04T08:56:59.4047637Z * [new branch] gh/nikitaved/9/base -> origin/gh/nikitaved/9/base 2025-12-04T08:56:59.4049263Z * [new branch] gh/nikitaved/9/head -> origin/gh/nikitaved/9/head 2025-12-04T08:56:59.4050806Z * [new branch] gh/nikitaved/9/orig -> origin/gh/nikitaved/9/orig 2025-12-04T08:56:59.4054105Z * [new branch] gh/oulgen/10/base -> origin/gh/oulgen/10/base 2025-12-04T08:56:59.4055603Z * [new branch] gh/oulgen/10/head -> origin/gh/oulgen/10/head 2025-12-04T08:56:59.4057129Z * [new branch] gh/oulgen/10/orig -> origin/gh/oulgen/10/orig 2025-12-04T08:56:59.4059459Z * [new branch] gh/oulgen/11/base -> origin/gh/oulgen/11/base 2025-12-04T08:56:59.4061056Z * [new branch] gh/oulgen/11/head -> origin/gh/oulgen/11/head 2025-12-04T08:56:59.4063534Z * [new branch] gh/oulgen/11/orig -> origin/gh/oulgen/11/orig 2025-12-04T08:56:59.4065642Z * [new branch] gh/oulgen/12/base -> origin/gh/oulgen/12/base 2025-12-04T08:56:59.4067168Z * [new branch] gh/oulgen/12/head -> origin/gh/oulgen/12/head 2025-12-04T08:56:59.4068678Z * [new branch] gh/oulgen/12/orig -> origin/gh/oulgen/12/orig 2025-12-04T08:56:59.4070747Z * [new branch] gh/oulgen/13/base -> origin/gh/oulgen/13/base 2025-12-04T08:56:59.4072270Z * [new branch] gh/oulgen/13/head -> origin/gh/oulgen/13/head 2025-12-04T08:56:59.4073859Z * [new branch] gh/oulgen/13/orig -> origin/gh/oulgen/13/orig 2025-12-04T08:56:59.4075994Z * [new branch] gh/oulgen/14/base -> origin/gh/oulgen/14/base 2025-12-04T08:56:59.4077573Z * [new branch] gh/oulgen/14/head -> origin/gh/oulgen/14/head 2025-12-04T08:56:59.4079114Z * [new branch] gh/oulgen/14/orig -> origin/gh/oulgen/14/orig 2025-12-04T08:56:59.4081179Z * [new branch] gh/oulgen/15/base -> origin/gh/oulgen/15/base 2025-12-04T08:56:59.4082662Z * [new branch] gh/oulgen/15/head -> origin/gh/oulgen/15/head 2025-12-04T08:56:59.4084173Z * [new branch] gh/oulgen/15/orig -> origin/gh/oulgen/15/orig 2025-12-04T08:56:59.4086163Z * [new branch] gh/oulgen/16/base -> origin/gh/oulgen/16/base 2025-12-04T08:56:59.4087921Z * [new branch] gh/oulgen/16/head -> origin/gh/oulgen/16/head 2025-12-04T08:56:59.4089862Z * [new branch] gh/oulgen/16/orig -> origin/gh/oulgen/16/orig 2025-12-04T08:56:59.4093160Z * [new branch] gh/oulgen/17/base -> origin/gh/oulgen/17/base 2025-12-04T08:56:59.4095304Z * [new branch] gh/oulgen/17/head -> origin/gh/oulgen/17/head 2025-12-04T08:56:59.4097524Z * [new branch] gh/oulgen/17/orig -> origin/gh/oulgen/17/orig 2025-12-04T08:56:59.4100521Z * [new branch] gh/oulgen/18/base -> origin/gh/oulgen/18/base 2025-12-04T08:56:59.4102716Z * [new branch] gh/oulgen/18/head -> origin/gh/oulgen/18/head 2025-12-04T08:56:59.4105325Z * [new branch] gh/oulgen/18/orig -> origin/gh/oulgen/18/orig 2025-12-04T08:56:59.4108432Z * [new branch] gh/oulgen/19/base -> origin/gh/oulgen/19/base 2025-12-04T08:56:59.4110737Z * [new branch] gh/oulgen/19/head -> origin/gh/oulgen/19/head 2025-12-04T08:56:59.4112140Z * [new branch] gh/oulgen/19/orig -> origin/gh/oulgen/19/orig 2025-12-04T08:56:59.4114228Z * [new branch] gh/oulgen/20/base -> origin/gh/oulgen/20/base 2025-12-04T08:56:59.4115756Z * [new branch] gh/oulgen/20/head -> origin/gh/oulgen/20/head 2025-12-04T08:56:59.4117359Z * [new branch] gh/oulgen/20/orig -> origin/gh/oulgen/20/orig 2025-12-04T08:56:59.4119348Z * [new branch] gh/oulgen/21/base -> origin/gh/oulgen/21/base 2025-12-04T08:56:59.4120917Z * [new branch] gh/oulgen/21/head -> origin/gh/oulgen/21/head 2025-12-04T08:56:59.4122455Z * [new branch] gh/oulgen/21/orig -> origin/gh/oulgen/21/orig 2025-12-04T08:56:59.4124582Z * [new branch] gh/oulgen/22/base -> origin/gh/oulgen/22/base 2025-12-04T08:56:59.4126484Z * [new branch] gh/oulgen/22/head -> origin/gh/oulgen/22/head 2025-12-04T08:56:59.4128047Z * [new branch] gh/oulgen/22/orig -> origin/gh/oulgen/22/orig 2025-12-04T08:56:59.4130101Z * [new branch] gh/oulgen/23/base -> origin/gh/oulgen/23/base 2025-12-04T08:56:59.4131616Z * [new branch] gh/oulgen/23/head -> origin/gh/oulgen/23/head 2025-12-04T08:56:59.4133275Z * [new branch] gh/oulgen/23/orig -> origin/gh/oulgen/23/orig 2025-12-04T08:56:59.4135235Z * [new branch] gh/oulgen/24/base -> origin/gh/oulgen/24/base 2025-12-04T08:56:59.4136812Z * [new branch] gh/oulgen/24/head -> origin/gh/oulgen/24/head 2025-12-04T08:56:59.4138326Z * [new branch] gh/oulgen/24/orig -> origin/gh/oulgen/24/orig 2025-12-04T08:56:59.4140409Z * [new branch] gh/oulgen/25/base -> origin/gh/oulgen/25/base 2025-12-04T08:56:59.4141936Z * [new branch] gh/oulgen/25/head -> origin/gh/oulgen/25/head 2025-12-04T08:56:59.4143516Z * [new branch] gh/oulgen/25/orig -> origin/gh/oulgen/25/orig 2025-12-04T08:56:59.4145669Z * [new branch] gh/oulgen/26/base -> origin/gh/oulgen/26/base 2025-12-04T08:56:59.4147228Z * [new branch] gh/oulgen/26/head -> origin/gh/oulgen/26/head 2025-12-04T08:56:59.4148781Z * [new branch] gh/oulgen/26/orig -> origin/gh/oulgen/26/orig 2025-12-04T08:56:59.4150865Z * [new branch] gh/oulgen/4/base -> origin/gh/oulgen/4/base 2025-12-04T08:56:59.4152374Z * [new branch] gh/oulgen/4/head -> origin/gh/oulgen/4/head 2025-12-04T08:56:59.4153902Z * [new branch] gh/oulgen/4/orig -> origin/gh/oulgen/4/orig 2025-12-04T08:56:59.4156415Z * [new branch] gh/oulgen/7/base -> origin/gh/oulgen/7/base 2025-12-04T08:56:59.4158428Z * [new branch] gh/oulgen/7/head -> origin/gh/oulgen/7/head 2025-12-04T08:56:59.4159978Z * [new branch] gh/oulgen/7/orig -> origin/gh/oulgen/7/orig 2025-12-04T08:56:59.4162211Z * [new branch] gh/oulgen/8/base -> origin/gh/oulgen/8/base 2025-12-04T08:56:59.4163990Z * [new branch] gh/oulgen/8/head -> origin/gh/oulgen/8/head 2025-12-04T08:56:59.4165552Z * [new branch] gh/oulgen/8/orig -> origin/gh/oulgen/8/orig 2025-12-04T08:56:59.4167709Z * [new branch] gh/oulgen/9/base -> origin/gh/oulgen/9/base 2025-12-04T08:56:59.4169286Z * [new branch] gh/oulgen/9/head -> origin/gh/oulgen/9/head 2025-12-04T08:56:59.4170846Z * [new branch] gh/oulgen/9/orig -> origin/gh/oulgen/9/orig 2025-12-04T08:56:59.4173494Z * [new branch] gh/patvig/mtia-serialization -> origin/gh/patvig/mtia-serialization 2025-12-04T08:56:59.4175882Z * [new branch] gh/pearu/108/base -> origin/gh/pearu/108/base 2025-12-04T08:56:59.4178147Z * [new branch] gh/pearu/108/head -> origin/gh/pearu/108/head 2025-12-04T08:56:59.4179757Z * [new branch] gh/pearu/108/orig -> origin/gh/pearu/108/orig 2025-12-04T08:56:59.4181775Z * [new branch] gh/pearu/109/base -> origin/gh/pearu/109/base 2025-12-04T08:56:59.4183341Z * [new branch] gh/pearu/109/head -> origin/gh/pearu/109/head 2025-12-04T08:56:59.4184939Z * [new branch] gh/pearu/109/orig -> origin/gh/pearu/109/orig 2025-12-04T08:56:59.4187053Z * [new branch] gh/pearu/110/base -> origin/gh/pearu/110/base 2025-12-04T08:56:59.4188619Z * [new branch] gh/pearu/110/head -> origin/gh/pearu/110/head 2025-12-04T08:56:59.4190144Z * [new branch] gh/pearu/110/orig -> origin/gh/pearu/110/orig 2025-12-04T08:56:59.4192222Z * [new branch] gh/pearu/111/base -> origin/gh/pearu/111/base 2025-12-04T08:56:59.4193720Z * [new branch] gh/pearu/111/head -> origin/gh/pearu/111/head 2025-12-04T08:56:59.4195290Z * [new branch] gh/pearu/111/orig -> origin/gh/pearu/111/orig 2025-12-04T08:56:59.4197399Z * [new branch] gh/pearu/112/base -> origin/gh/pearu/112/base 2025-12-04T08:56:59.4199000Z * [new branch] gh/pearu/112/head -> origin/gh/pearu/112/head 2025-12-04T08:56:59.4200560Z * [new branch] gh/pearu/112/orig -> origin/gh/pearu/112/orig 2025-12-04T08:56:59.4203009Z * [new branch] gh/pearu/115/base -> origin/gh/pearu/115/base 2025-12-04T08:56:59.4204612Z * [new branch] gh/pearu/115/head -> origin/gh/pearu/115/head 2025-12-04T08:56:59.4206199Z * [new branch] gh/pearu/115/orig -> origin/gh/pearu/115/orig 2025-12-04T08:56:59.4208220Z * [new branch] gh/pearu/116/base -> origin/gh/pearu/116/base 2025-12-04T08:56:59.4209762Z * [new branch] gh/pearu/116/head -> origin/gh/pearu/116/head 2025-12-04T08:56:59.4211281Z * [new branch] gh/pearu/116/orig -> origin/gh/pearu/116/orig 2025-12-04T08:56:59.4213495Z * [new branch] gh/pearu/117/base -> origin/gh/pearu/117/base 2025-12-04T08:56:59.4215070Z * [new branch] gh/pearu/117/head -> origin/gh/pearu/117/head 2025-12-04T08:56:59.4216630Z * [new branch] gh/pearu/117/orig -> origin/gh/pearu/117/orig 2025-12-04T08:56:59.4218674Z * [new branch] gh/pearu/118/base -> origin/gh/pearu/118/base 2025-12-04T08:56:59.4220215Z * [new branch] gh/pearu/118/head -> origin/gh/pearu/118/head 2025-12-04T08:56:59.4221996Z * [new branch] gh/pearu/118/orig -> origin/gh/pearu/118/orig 2025-12-04T08:56:59.4223935Z * [new branch] gh/pearu/119/base -> origin/gh/pearu/119/base 2025-12-04T08:56:59.4225643Z * [new branch] gh/pearu/119/head -> origin/gh/pearu/119/head 2025-12-04T08:56:59.4227162Z * [new branch] gh/pearu/119/orig -> origin/gh/pearu/119/orig 2025-12-04T08:56:59.4229249Z * [new branch] gh/pearu/139/base -> origin/gh/pearu/139/base 2025-12-04T08:56:59.4230782Z * [new branch] gh/pearu/139/head -> origin/gh/pearu/139/head 2025-12-04T08:56:59.4232336Z * [new branch] gh/pearu/139/orig -> origin/gh/pearu/139/orig 2025-12-04T08:56:59.4234322Z * [new branch] gh/pearu/140/base -> origin/gh/pearu/140/base 2025-12-04T08:56:59.4235875Z * [new branch] gh/pearu/140/head -> origin/gh/pearu/140/head 2025-12-04T08:56:59.4237586Z * [new branch] gh/pearu/140/orig -> origin/gh/pearu/140/orig 2025-12-04T08:56:59.4239541Z * [new branch] gh/pearu/142/base -> origin/gh/pearu/142/base 2025-12-04T08:56:59.4241024Z * [new branch] gh/pearu/142/head -> origin/gh/pearu/142/head 2025-12-04T08:56:59.4242506Z * [new branch] gh/pearu/142/orig -> origin/gh/pearu/142/orig 2025-12-04T08:56:59.4244568Z * [new branch] gh/pearu/143/base -> origin/gh/pearu/143/base 2025-12-04T08:56:59.4246098Z * [new branch] gh/pearu/143/head -> origin/gh/pearu/143/head 2025-12-04T08:56:59.4247677Z * [new branch] gh/pearu/143/orig -> origin/gh/pearu/143/orig 2025-12-04T08:56:59.4249863Z * [new branch] gh/pearu/147/base -> origin/gh/pearu/147/base 2025-12-04T08:56:59.4251417Z * [new branch] gh/pearu/147/head -> origin/gh/pearu/147/head 2025-12-04T08:56:59.4253116Z * [new branch] gh/pearu/147/orig -> origin/gh/pearu/147/orig 2025-12-04T08:56:59.4255117Z * [new branch] gh/pearu/149/base -> origin/gh/pearu/149/base 2025-12-04T08:56:59.4256610Z * [new branch] gh/pearu/149/head -> origin/gh/pearu/149/head 2025-12-04T08:56:59.4258133Z * [new branch] gh/pearu/149/orig -> origin/gh/pearu/149/orig 2025-12-04T08:56:59.4260699Z * [new branch] gh/pearu/150/base -> origin/gh/pearu/150/base 2025-12-04T08:56:59.4262379Z * [new branch] gh/pearu/150/head -> origin/gh/pearu/150/head 2025-12-04T08:56:59.4263929Z * [new branch] gh/pearu/150/orig -> origin/gh/pearu/150/orig 2025-12-04T08:56:59.4266143Z * [new branch] gh/pearu/151/base -> origin/gh/pearu/151/base 2025-12-04T08:56:59.4267740Z * [new branch] gh/pearu/151/head -> origin/gh/pearu/151/head 2025-12-04T08:56:59.4269285Z * [new branch] gh/pearu/151/orig -> origin/gh/pearu/151/orig 2025-12-04T08:56:59.4271538Z * [new branch] gh/pearu/152/base -> origin/gh/pearu/152/base 2025-12-04T08:56:59.4273273Z * [new branch] gh/pearu/152/head -> origin/gh/pearu/152/head 2025-12-04T08:56:59.4274975Z * [new branch] gh/pearu/152/orig -> origin/gh/pearu/152/orig 2025-12-04T08:56:59.4276999Z * [new branch] gh/pearu/153/base -> origin/gh/pearu/153/base 2025-12-04T08:56:59.4278325Z * [new branch] gh/pearu/153/head -> origin/gh/pearu/153/head 2025-12-04T08:56:59.4279856Z * [new branch] gh/pearu/153/orig -> origin/gh/pearu/153/orig 2025-12-04T08:56:59.4282009Z * [new branch] gh/pearu/154/base -> origin/gh/pearu/154/base 2025-12-04T08:56:59.4283517Z * [new branch] gh/pearu/154/head -> origin/gh/pearu/154/head 2025-12-04T08:56:59.4285081Z * [new branch] gh/pearu/154/orig -> origin/gh/pearu/154/orig 2025-12-04T08:56:59.4287286Z * [new branch] gh/pearu/155/base -> origin/gh/pearu/155/base 2025-12-04T08:56:59.4288882Z * [new branch] gh/pearu/155/head -> origin/gh/pearu/155/head 2025-12-04T08:56:59.4290366Z * [new branch] gh/pearu/155/orig -> origin/gh/pearu/155/orig 2025-12-04T08:56:59.4292641Z * [new branch] gh/pearu/156/base -> origin/gh/pearu/156/base 2025-12-04T08:56:59.4294652Z * [new branch] gh/pearu/156/head -> origin/gh/pearu/156/head 2025-12-04T08:56:59.4296217Z * [new branch] gh/pearu/156/orig -> origin/gh/pearu/156/orig 2025-12-04T08:56:59.4298713Z * [new branch] gh/pearu/56/base -> origin/gh/pearu/56/base 2025-12-04T08:56:59.4300434Z * [new branch] gh/pearu/56/head -> origin/gh/pearu/56/head 2025-12-04T08:56:59.4302195Z * [new branch] gh/pearu/56/orig -> origin/gh/pearu/56/orig 2025-12-04T08:56:59.4304386Z * [new branch] gh/pearu/97/base -> origin/gh/pearu/97/base 2025-12-04T08:56:59.4305902Z * [new branch] gh/pearu/97/head -> origin/gh/pearu/97/head 2025-12-04T08:56:59.4307568Z * [new branch] gh/pearu/97/orig -> origin/gh/pearu/97/orig 2025-12-04T08:56:59.4310063Z * [new branch] gh/pianpwk/21/base -> origin/gh/pianpwk/21/base 2025-12-04T08:56:59.4311620Z * [new branch] gh/pianpwk/21/head -> origin/gh/pianpwk/21/head 2025-12-04T08:56:59.4313881Z * [new branch] gh/pianpwk/28/base -> origin/gh/pianpwk/28/base 2025-12-04T08:56:59.4315487Z * [new branch] gh/pianpwk/28/head -> origin/gh/pianpwk/28/head 2025-12-04T08:56:59.4317479Z * [new branch] gh/pianpwk/28/orig -> origin/gh/pianpwk/28/orig 2025-12-04T08:56:59.4319621Z * [new branch] gh/pianpwk/29/base -> origin/gh/pianpwk/29/base 2025-12-04T08:56:59.4321280Z * [new branch] gh/pianpwk/29/head -> origin/gh/pianpwk/29/head 2025-12-04T08:56:59.4322787Z * [new branch] gh/pianpwk/29/orig -> origin/gh/pianpwk/29/orig 2025-12-04T08:56:59.4325803Z * [new branch] gh/pianpwk/30/base -> origin/gh/pianpwk/30/base 2025-12-04T08:56:59.4327285Z * [new branch] gh/pianpwk/30/head -> origin/gh/pianpwk/30/head 2025-12-04T08:56:59.4328950Z * [new branch] gh/pianpwk/30/orig -> origin/gh/pianpwk/30/orig 2025-12-04T08:56:59.4330963Z * [new branch] gh/pianpwk/31/base -> origin/gh/pianpwk/31/base 2025-12-04T08:56:59.4332711Z * [new branch] gh/pianpwk/31/head -> origin/gh/pianpwk/31/head 2025-12-04T08:56:59.4334145Z * [new branch] gh/pianpwk/31/orig -> origin/gh/pianpwk/31/orig 2025-12-04T08:56:59.4336079Z * [new branch] gh/pianpwk/32/base -> origin/gh/pianpwk/32/base 2025-12-04T08:56:59.4337625Z * [new branch] gh/pianpwk/32/head -> origin/gh/pianpwk/32/head 2025-12-04T08:56:59.4339166Z * [new branch] gh/pianpwk/32/orig -> origin/gh/pianpwk/32/orig 2025-12-04T08:56:59.4341099Z * [new branch] gh/pianpwk/33/base -> origin/gh/pianpwk/33/base 2025-12-04T08:56:59.4342626Z * [new branch] gh/pianpwk/33/head -> origin/gh/pianpwk/33/head 2025-12-04T08:56:59.4344149Z * [new branch] gh/pianpwk/33/orig -> origin/gh/pianpwk/33/orig 2025-12-04T08:56:59.4346441Z * [new branch] gh/pianpwk/34/base -> origin/gh/pianpwk/34/base 2025-12-04T08:56:59.4348173Z * [new branch] gh/pianpwk/34/head -> origin/gh/pianpwk/34/head 2025-12-04T08:56:59.4349836Z * [new branch] gh/pianpwk/34/orig -> origin/gh/pianpwk/34/orig 2025-12-04T08:56:59.4351858Z * [new branch] gh/pianpwk/35/base -> origin/gh/pianpwk/35/base 2025-12-04T08:56:59.4353515Z * [new branch] gh/pianpwk/35/head -> origin/gh/pianpwk/35/head 2025-12-04T08:56:59.4355109Z * [new branch] gh/pianpwk/35/orig -> origin/gh/pianpwk/35/orig 2025-12-04T08:56:59.4357643Z * [new branch] gh/rec/141/base -> origin/gh/rec/141/base 2025-12-04T08:56:59.4359229Z * [new branch] gh/rec/141/head -> origin/gh/rec/141/head 2025-12-04T08:56:59.4361511Z * [new branch] gh/rec/153/base -> origin/gh/rec/153/base 2025-12-04T08:56:59.4363283Z * [new branch] gh/rec/153/head -> origin/gh/rec/153/head 2025-12-04T08:56:59.4364869Z * [new branch] gh/rec/153/orig -> origin/gh/rec/153/orig 2025-12-04T08:56:59.4366917Z * [new branch] gh/rec/154/base -> origin/gh/rec/154/base 2025-12-04T08:56:59.4368651Z * [new branch] gh/rec/154/head -> origin/gh/rec/154/head 2025-12-04T08:56:59.4370052Z * [new branch] gh/rec/154/orig -> origin/gh/rec/154/orig 2025-12-04T08:56:59.4372146Z * [new branch] gh/rec/164/base -> origin/gh/rec/164/base 2025-12-04T08:56:59.4374228Z * [new branch] gh/rec/164/head -> origin/gh/rec/164/head 2025-12-04T08:56:59.4375771Z * [new branch] gh/rec/164/orig -> origin/gh/rec/164/orig 2025-12-04T08:56:59.4377823Z * [new branch] gh/rec/166/base -> origin/gh/rec/166/base 2025-12-04T08:56:59.4379334Z * [new branch] gh/rec/166/head -> origin/gh/rec/166/head 2025-12-04T08:56:59.4380884Z * [new branch] gh/rec/166/orig -> origin/gh/rec/166/orig 2025-12-04T08:56:59.4383120Z * [new branch] gh/rec/167/base -> origin/gh/rec/167/base 2025-12-04T08:56:59.4384595Z * [new branch] gh/rec/167/head -> origin/gh/rec/167/head 2025-12-04T08:56:59.4386124Z * [new branch] gh/rec/167/orig -> origin/gh/rec/167/orig 2025-12-04T08:56:59.4388242Z * [new branch] gh/rec/168/base -> origin/gh/rec/168/base 2025-12-04T08:56:59.4389886Z * [new branch] gh/rec/168/head -> origin/gh/rec/168/head 2025-12-04T08:56:59.4391405Z * [new branch] gh/rec/168/orig -> origin/gh/rec/168/orig 2025-12-04T08:56:59.4393391Z * [new branch] gh/rec/169/base -> origin/gh/rec/169/base 2025-12-04T08:56:59.4394927Z * [new branch] gh/rec/169/head -> origin/gh/rec/169/head 2025-12-04T08:56:59.4396474Z * [new branch] gh/rec/169/orig -> origin/gh/rec/169/orig 2025-12-04T08:56:59.4398614Z * [new branch] gh/rec/170/base -> origin/gh/rec/170/base 2025-12-04T08:56:59.4400187Z * [new branch] gh/rec/170/head -> origin/gh/rec/170/head 2025-12-04T08:56:59.4401727Z * [new branch] gh/rec/170/orig -> origin/gh/rec/170/orig 2025-12-04T08:56:59.4403758Z * [new branch] gh/rec/171/base -> origin/gh/rec/171/base 2025-12-04T08:56:59.4405310Z * [new branch] gh/rec/171/head -> origin/gh/rec/171/head 2025-12-04T08:56:59.4406909Z * [new branch] gh/rec/171/orig -> origin/gh/rec/171/orig 2025-12-04T08:56:59.4408932Z * [new branch] gh/rec/172/base -> origin/gh/rec/172/base 2025-12-04T08:56:59.4410447Z * [new branch] gh/rec/172/head -> origin/gh/rec/172/head 2025-12-04T08:56:59.4411936Z * [new branch] gh/rec/172/orig -> origin/gh/rec/172/orig 2025-12-04T08:56:59.4414205Z * [new branch] gh/rec/173/base -> origin/gh/rec/173/base 2025-12-04T08:56:59.4415668Z * [new branch] gh/rec/173/head -> origin/gh/rec/173/head 2025-12-04T08:56:59.4417327Z * [new branch] gh/rec/173/orig -> origin/gh/rec/173/orig 2025-12-04T08:56:59.4419968Z * [new branch] gh/rec/174/base -> origin/gh/rec/174/base 2025-12-04T08:56:59.4421545Z * [new branch] gh/rec/174/head -> origin/gh/rec/174/head 2025-12-04T08:56:59.4423103Z * [new branch] gh/rec/174/orig -> origin/gh/rec/174/orig 2025-12-04T08:56:59.4425185Z * [new branch] gh/rec/175/base -> origin/gh/rec/175/base 2025-12-04T08:56:59.4426692Z * [new branch] gh/rec/175/head -> origin/gh/rec/175/head 2025-12-04T08:56:59.4428296Z * [new branch] gh/rec/175/orig -> origin/gh/rec/175/orig 2025-12-04T08:56:59.4430313Z * [new branch] gh/rec/176/base -> origin/gh/rec/176/base 2025-12-04T08:56:59.4431846Z * [new branch] gh/rec/176/head -> origin/gh/rec/176/head 2025-12-04T08:56:59.4433462Z * [new branch] gh/rec/176/orig -> origin/gh/rec/176/orig 2025-12-04T08:56:59.4435411Z * [new branch] gh/rec/177/base -> origin/gh/rec/177/base 2025-12-04T08:56:59.4436984Z * [new branch] gh/rec/177/head -> origin/gh/rec/177/head 2025-12-04T08:56:59.4438502Z * [new branch] gh/rec/177/orig -> origin/gh/rec/177/orig 2025-12-04T08:56:59.4441124Z * [new branch] gh/robert-hardwick/3/base -> origin/gh/robert-hardwick/3/base 2025-12-04T08:56:59.4442739Z * [new branch] gh/robert-hardwick/3/head -> origin/gh/robert-hardwick/3/head 2025-12-04T08:56:59.4444264Z * [new branch] gh/robert-hardwick/3/orig -> origin/gh/robert-hardwick/3/orig 2025-12-04T08:56:59.4446797Z * [new branch] gh/robert-hardwick/4/base -> origin/gh/robert-hardwick/4/base 2025-12-04T08:56:59.4448372Z * [new branch] gh/robert-hardwick/4/head -> origin/gh/robert-hardwick/4/head 2025-12-04T08:56:59.4449903Z * [new branch] gh/robert-hardwick/4/orig -> origin/gh/robert-hardwick/4/orig 2025-12-04T08:56:59.4451975Z * [new branch] gh/robert-hardwick/5/base -> origin/gh/robert-hardwick/5/base 2025-12-04T08:56:59.4453698Z * [new branch] gh/robert-hardwick/5/head -> origin/gh/robert-hardwick/5/head 2025-12-04T08:56:59.4455230Z * [new branch] gh/robert-hardwick/5/orig -> origin/gh/robert-hardwick/5/orig 2025-12-04T08:56:59.4457318Z * [new branch] gh/robert-hardwick/6/base -> origin/gh/robert-hardwick/6/base 2025-12-04T08:56:59.4458935Z * [new branch] gh/robert-hardwick/6/head -> origin/gh/robert-hardwick/6/head 2025-12-04T08:56:59.4460409Z * [new branch] gh/robert-hardwick/6/orig -> origin/gh/robert-hardwick/6/orig 2025-12-04T08:56:59.4464204Z * [new branch] gh/robert-hardwick/7/base -> origin/gh/robert-hardwick/7/base 2025-12-04T08:56:59.4465840Z * [new branch] gh/robert-hardwick/7/head -> origin/gh/robert-hardwick/7/head 2025-12-04T08:56:59.4467442Z * [new branch] gh/robert-hardwick/7/orig -> origin/gh/robert-hardwick/7/orig 2025-12-04T08:56:59.4469530Z * [new branch] gh/robert-hardwick/8/base -> origin/gh/robert-hardwick/8/base 2025-12-04T08:56:59.4471035Z * [new branch] gh/robert-hardwick/8/head -> origin/gh/robert-hardwick/8/head 2025-12-04T08:56:59.4472556Z * [new branch] gh/robert-hardwick/8/orig -> origin/gh/robert-hardwick/8/orig 2025-12-04T08:56:59.4474663Z * [new branch] gh/robert-hardwick/9/base -> origin/gh/robert-hardwick/9/base 2025-12-04T08:56:59.4476165Z * [new branch] gh/robert-hardwick/9/head -> origin/gh/robert-hardwick/9/head 2025-12-04T08:56:59.4477664Z * [new branch] gh/robert-hardwick/9/orig -> origin/gh/robert-hardwick/9/orig 2025-12-04T08:56:59.4480165Z * [new branch] gh/rtimpe/1/base -> origin/gh/rtimpe/1/base 2025-12-04T08:56:59.4481758Z * [new branch] gh/rtimpe/1/head -> origin/gh/rtimpe/1/head 2025-12-04T08:56:59.4483887Z * [new branch] gh/rtimpe/2/base -> origin/gh/rtimpe/2/base 2025-12-04T08:56:59.4485373Z * [new branch] gh/rtimpe/2/head -> origin/gh/rtimpe/2/head 2025-12-04T08:56:59.4487912Z * [new branch] gh/rtimpe/22/base -> origin/gh/rtimpe/22/base 2025-12-04T08:56:59.4489532Z * [new branch] gh/rtimpe/22/head -> origin/gh/rtimpe/22/head 2025-12-04T08:56:59.4491036Z * [new branch] gh/rtimpe/22/orig -> origin/gh/rtimpe/22/orig 2025-12-04T08:56:59.4493222Z * [new branch] gh/rtimpe/23/base -> origin/gh/rtimpe/23/base 2025-12-04T08:56:59.4494736Z * [new branch] gh/rtimpe/23/head -> origin/gh/rtimpe/23/head 2025-12-04T08:56:59.4496412Z * [new branch] gh/rtimpe/23/orig -> origin/gh/rtimpe/23/orig 2025-12-04T08:56:59.4498359Z * [new branch] gh/rtimpe/24/base -> origin/gh/rtimpe/24/base 2025-12-04T08:56:59.4499837Z * [new branch] gh/rtimpe/24/head -> origin/gh/rtimpe/24/head 2025-12-04T08:56:59.4501332Z * [new branch] gh/rtimpe/24/orig -> origin/gh/rtimpe/24/orig 2025-12-04T08:56:59.4503502Z * [new branch] gh/rtimpe/25/base -> origin/gh/rtimpe/25/base 2025-12-04T08:56:59.4505006Z * [new branch] gh/rtimpe/25/head -> origin/gh/rtimpe/25/head 2025-12-04T08:56:59.4506591Z * [new branch] gh/rtimpe/25/orig -> origin/gh/rtimpe/25/orig 2025-12-04T08:56:59.4508629Z * [new branch] gh/rtimpe/26/base -> origin/gh/rtimpe/26/base 2025-12-04T08:56:59.4510225Z * [new branch] gh/rtimpe/26/head -> origin/gh/rtimpe/26/head 2025-12-04T08:56:59.4511731Z * [new branch] gh/rtimpe/26/orig -> origin/gh/rtimpe/26/orig 2025-12-04T08:56:59.4513714Z * [new branch] gh/rtimpe/27/base -> origin/gh/rtimpe/27/base 2025-12-04T08:56:59.4515223Z * [new branch] gh/rtimpe/27/head -> origin/gh/rtimpe/27/head 2025-12-04T08:56:59.4516797Z * [new branch] gh/rtimpe/27/orig -> origin/gh/rtimpe/27/orig 2025-12-04T08:56:59.4518815Z * [new branch] gh/rtimpe/28/base -> origin/gh/rtimpe/28/base 2025-12-04T08:56:59.4520323Z * [new branch] gh/rtimpe/28/head -> origin/gh/rtimpe/28/head 2025-12-04T08:56:59.4521912Z * [new branch] gh/rtimpe/28/orig -> origin/gh/rtimpe/28/orig 2025-12-04T08:56:59.4524132Z * [new branch] gh/rtimpe/29/base -> origin/gh/rtimpe/29/base 2025-12-04T08:56:59.4525658Z * [new branch] gh/rtimpe/29/head -> origin/gh/rtimpe/29/head 2025-12-04T08:56:59.4527248Z * [new branch] gh/rtimpe/29/orig -> origin/gh/rtimpe/29/orig 2025-12-04T08:56:59.4529239Z * [new branch] gh/rtimpe/3/base -> origin/gh/rtimpe/3/base 2025-12-04T08:56:59.4530709Z * [new branch] gh/rtimpe/3/head -> origin/gh/rtimpe/3/head 2025-12-04T08:56:59.4532850Z * [new branch] gh/rtimpe/30/base -> origin/gh/rtimpe/30/base 2025-12-04T08:56:59.4534392Z * [new branch] gh/rtimpe/30/head -> origin/gh/rtimpe/30/head 2025-12-04T08:56:59.4535919Z * [new branch] gh/rtimpe/30/orig -> origin/gh/rtimpe/30/orig 2025-12-04T08:56:59.4537970Z * [new branch] gh/rtimpe/31/base -> origin/gh/rtimpe/31/base 2025-12-04T08:56:59.4539478Z * [new branch] gh/rtimpe/31/head -> origin/gh/rtimpe/31/head 2025-12-04T08:56:59.4541091Z * [new branch] gh/rtimpe/31/orig -> origin/gh/rtimpe/31/orig 2025-12-04T08:56:59.4543275Z * [new branch] gh/rtimpe/32/base -> origin/gh/rtimpe/32/base 2025-12-04T08:56:59.4544964Z * [new branch] gh/rtimpe/32/head -> origin/gh/rtimpe/32/head 2025-12-04T08:56:59.4546502Z * [new branch] gh/rtimpe/32/orig -> origin/gh/rtimpe/32/orig 2025-12-04T08:56:59.4549041Z * [new branch] gh/rtimpe/33/base -> origin/gh/rtimpe/33/base 2025-12-04T08:56:59.4550594Z * [new branch] gh/rtimpe/33/head -> origin/gh/rtimpe/33/head 2025-12-04T08:56:59.4552151Z * [new branch] gh/rtimpe/33/orig -> origin/gh/rtimpe/33/orig 2025-12-04T08:56:59.4554497Z * [new branch] gh/rtimpe/34/base -> origin/gh/rtimpe/34/base 2025-12-04T08:56:59.4556043Z * [new branch] gh/rtimpe/34/head -> origin/gh/rtimpe/34/head 2025-12-04T08:56:59.4557606Z * [new branch] gh/rtimpe/34/orig -> origin/gh/rtimpe/34/orig 2025-12-04T08:56:59.4559759Z * [new branch] gh/rtimpe/35/base -> origin/gh/rtimpe/35/base 2025-12-04T08:56:59.4561261Z * [new branch] gh/rtimpe/35/head -> origin/gh/rtimpe/35/head 2025-12-04T08:56:59.4563045Z * [new branch] gh/rtimpe/35/orig -> origin/gh/rtimpe/35/orig 2025-12-04T08:56:59.4565136Z * [new branch] gh/rtimpe/4/base -> origin/gh/rtimpe/4/base 2025-12-04T08:56:59.4566668Z * [new branch] gh/rtimpe/4/head -> origin/gh/rtimpe/4/head 2025-12-04T08:56:59.4569275Z * [new branch] gh/ruisizhang123/1/base -> origin/gh/ruisizhang123/1/base 2025-12-04T08:56:59.4570839Z * [new branch] gh/ruisizhang123/1/head -> origin/gh/ruisizhang123/1/head 2025-12-04T08:56:59.4572674Z * [new branch] gh/ruisizhang123/1/orig -> origin/gh/ruisizhang123/1/orig 2025-12-04T08:56:59.4574797Z * [new branch] gh/ruisizhang123/4/base -> origin/gh/ruisizhang123/4/base 2025-12-04T08:56:59.4576344Z * [new branch] gh/ruisizhang123/4/head -> origin/gh/ruisizhang123/4/head 2025-12-04T08:56:59.4577893Z * [new branch] gh/ruisizhang123/4/orig -> origin/gh/ruisizhang123/4/orig 2025-12-04T08:56:59.4579999Z * [new branch] gh/ruisizhang123/5/base -> origin/gh/ruisizhang123/5/base 2025-12-04T08:56:59.4581600Z * [new branch] gh/ruisizhang123/5/head -> origin/gh/ruisizhang123/5/head 2025-12-04T08:56:59.4583083Z * [new branch] gh/ruisizhang123/5/orig -> origin/gh/ruisizhang123/5/orig 2025-12-04T08:56:59.4585114Z * [new branch] gh/ruisizhang123/6/base -> origin/gh/ruisizhang123/6/base 2025-12-04T08:56:59.4586764Z * [new branch] gh/ruisizhang123/6/head -> origin/gh/ruisizhang123/6/head 2025-12-04T08:56:59.4588263Z * [new branch] gh/ruisizhang123/6/orig -> origin/gh/ruisizhang123/6/orig 2025-12-04T08:56:59.4590331Z * [new branch] gh/ruisizhang123/7/base -> origin/gh/ruisizhang123/7/base 2025-12-04T08:56:59.4591920Z * [new branch] gh/ruisizhang123/7/head -> origin/gh/ruisizhang123/7/head 2025-12-04T08:56:59.4593499Z * [new branch] gh/ruisizhang123/7/orig -> origin/gh/ruisizhang123/7/orig 2025-12-04T08:56:59.4595479Z * [new branch] gh/ruisizhang123/8/base -> origin/gh/ruisizhang123/8/base 2025-12-04T08:56:59.4597046Z * [new branch] gh/ruisizhang123/8/head -> origin/gh/ruisizhang123/8/head 2025-12-04T08:56:59.4598563Z * [new branch] gh/ruisizhang123/8/orig -> origin/gh/ruisizhang123/8/orig 2025-12-04T08:56:59.4600623Z * [new branch] gh/ruisizhang123/9/base -> origin/gh/ruisizhang123/9/base 2025-12-04T08:56:59.4602157Z * [new branch] gh/ruisizhang123/9/head -> origin/gh/ruisizhang123/9/head 2025-12-04T08:56:59.4603649Z * [new branch] gh/ruisizhang123/9/orig -> origin/gh/ruisizhang123/9/orig 2025-12-04T08:56:59.4606295Z * [new branch] gh/seemethere/52/base -> origin/gh/seemethere/52/base 2025-12-04T08:56:59.4607856Z * [new branch] gh/seemethere/52/head -> origin/gh/seemethere/52/head 2025-12-04T08:56:59.4609453Z * [new branch] gh/seemethere/52/orig -> origin/gh/seemethere/52/orig 2025-12-04T08:56:59.4611919Z * [new branch] gh/seemethere/53/base -> origin/gh/seemethere/53/base 2025-12-04T08:56:59.4613731Z * [new branch] gh/seemethere/53/head -> origin/gh/seemethere/53/head 2025-12-04T08:56:59.4615242Z * [new branch] gh/seemethere/53/orig -> origin/gh/seemethere/53/orig 2025-12-04T08:56:59.4617307Z * [new branch] gh/seemethere/54/base -> origin/gh/seemethere/54/base 2025-12-04T08:56:59.4618853Z * [new branch] gh/seemethere/54/head -> origin/gh/seemethere/54/head 2025-12-04T08:56:59.4620437Z * [new branch] gh/seemethere/54/orig -> origin/gh/seemethere/54/orig 2025-12-04T08:56:59.4622573Z * [new branch] gh/seemethere/55/base -> origin/gh/seemethere/55/base 2025-12-04T08:56:59.4624005Z * [new branch] gh/seemethere/55/head -> origin/gh/seemethere/55/head 2025-12-04T08:56:59.4625529Z * [new branch] gh/seemethere/55/orig -> origin/gh/seemethere/55/orig 2025-12-04T08:56:59.4627593Z * [new branch] gh/seemethere/59/base -> origin/gh/seemethere/59/base 2025-12-04T08:56:59.4629185Z * [new branch] gh/seemethere/59/head -> origin/gh/seemethere/59/head 2025-12-04T08:56:59.4630718Z * [new branch] gh/seemethere/59/orig -> origin/gh/seemethere/59/orig 2025-12-04T08:56:59.4632808Z * [new branch] gh/seemethere/62/base -> origin/gh/seemethere/62/base 2025-12-04T08:56:59.4634793Z * [new branch] gh/seemethere/62/head -> origin/gh/seemethere/62/head 2025-12-04T08:56:59.4636367Z * [new branch] gh/seemethere/62/orig -> origin/gh/seemethere/62/orig 2025-12-04T08:56:59.4638358Z * [new branch] gh/seemethere/63/base -> origin/gh/seemethere/63/base 2025-12-04T08:56:59.4639922Z * [new branch] gh/seemethere/63/head -> origin/gh/seemethere/63/head 2025-12-04T08:56:59.4641484Z * [new branch] gh/seemethere/63/orig -> origin/gh/seemethere/63/orig 2025-12-04T08:56:59.4643422Z * [new branch] gh/seemethere/71/base -> origin/gh/seemethere/71/base 2025-12-04T08:56:59.4644944Z * [new branch] gh/seemethere/71/head -> origin/gh/seemethere/71/head 2025-12-04T08:56:59.4646489Z * [new branch] gh/seemethere/71/orig -> origin/gh/seemethere/71/orig 2025-12-04T08:56:59.4648689Z * [new branch] gh/seemethere/72/base -> origin/gh/seemethere/72/base 2025-12-04T08:56:59.4650234Z * [new branch] gh/seemethere/72/head -> origin/gh/seemethere/72/head 2025-12-04T08:56:59.4651827Z * [new branch] gh/seemethere/72/orig -> origin/gh/seemethere/72/orig 2025-12-04T08:56:59.4653979Z * [new branch] gh/seemethere/73/base -> origin/gh/seemethere/73/base 2025-12-04T08:56:59.4655539Z * [new branch] gh/seemethere/73/head -> origin/gh/seemethere/73/head 2025-12-04T08:56:59.4657095Z * [new branch] gh/seemethere/73/orig -> origin/gh/seemethere/73/orig 2025-12-04T08:56:59.4659112Z * [new branch] gh/seemethere/74/base -> origin/gh/seemethere/74/base 2025-12-04T08:56:59.4660652Z * [new branch] gh/seemethere/74/head -> origin/gh/seemethere/74/head 2025-12-04T08:56:59.4662228Z * [new branch] gh/seemethere/74/orig -> origin/gh/seemethere/74/orig 2025-12-04T08:56:59.4664989Z * [new branch] gh/seemethere/75/base -> origin/gh/seemethere/75/base 2025-12-04T08:56:59.4666522Z * [new branch] gh/seemethere/75/head -> origin/gh/seemethere/75/head 2025-12-04T08:56:59.4668123Z * [new branch] gh/seemethere/75/orig -> origin/gh/seemethere/75/orig 2025-12-04T08:56:59.4670271Z * [new branch] gh/seemethere/76/base -> origin/gh/seemethere/76/base 2025-12-04T08:56:59.4671804Z * [new branch] gh/seemethere/76/head -> origin/gh/seemethere/76/head 2025-12-04T08:56:59.4673391Z * [new branch] gh/seemethere/76/orig -> origin/gh/seemethere/76/orig 2025-12-04T08:56:59.4676550Z * [new branch] gh/shunting314/145/base -> origin/gh/shunting314/145/base 2025-12-04T08:56:59.4678313Z * [new branch] gh/shunting314/145/head -> origin/gh/shunting314/145/head 2025-12-04T08:56:59.4679938Z * [new branch] gh/shunting314/145/orig -> origin/gh/shunting314/145/orig 2025-12-04T08:56:59.4682170Z * [new branch] gh/shunting314/176/base -> origin/gh/shunting314/176/base 2025-12-04T08:56:59.4684020Z * [new branch] gh/shunting314/176/head -> origin/gh/shunting314/176/head 2025-12-04T08:56:59.4685482Z * [new branch] gh/shunting314/176/orig -> origin/gh/shunting314/176/orig 2025-12-04T08:56:59.4687575Z * [new branch] gh/shunting314/249/base -> origin/gh/shunting314/249/base 2025-12-04T08:56:59.4689197Z * [new branch] gh/shunting314/249/head -> origin/gh/shunting314/249/head 2025-12-04T08:56:59.4690807Z * [new branch] gh/shunting314/249/orig -> origin/gh/shunting314/249/orig 2025-12-04T08:56:59.4693254Z * [new branch] gh/shunting314/253/base -> origin/gh/shunting314/253/base 2025-12-04T08:56:59.4694713Z * [new branch] gh/shunting314/253/head -> origin/gh/shunting314/253/head 2025-12-04T08:56:59.4696244Z * [new branch] gh/shunting314/253/orig -> origin/gh/shunting314/253/orig 2025-12-04T08:56:59.4698307Z * [new branch] gh/shunting314/256/base -> origin/gh/shunting314/256/base 2025-12-04T08:56:59.4699915Z * [new branch] gh/shunting314/256/head -> origin/gh/shunting314/256/head 2025-12-04T08:56:59.4701423Z * [new branch] gh/shunting314/256/orig -> origin/gh/shunting314/256/orig 2025-12-04T08:56:59.4703699Z * [new branch] gh/shunting314/257/base -> origin/gh/shunting314/257/base 2025-12-04T08:56:59.4705269Z * [new branch] gh/shunting314/257/head -> origin/gh/shunting314/257/head 2025-12-04T08:56:59.4706805Z * [new branch] gh/shunting314/257/orig -> origin/gh/shunting314/257/orig 2025-12-04T08:56:59.4708989Z * [new branch] gh/shunting314/258/base -> origin/gh/shunting314/258/base 2025-12-04T08:56:59.4710494Z * [new branch] gh/shunting314/258/head -> origin/gh/shunting314/258/head 2025-12-04T08:56:59.4712099Z * [new branch] gh/shunting314/258/orig -> origin/gh/shunting314/258/orig 2025-12-04T08:56:59.4714098Z * [new branch] gh/shunting314/259/base -> origin/gh/shunting314/259/base 2025-12-04T08:56:59.4715667Z * [new branch] gh/shunting314/259/head -> origin/gh/shunting314/259/head 2025-12-04T08:56:59.4717213Z * [new branch] gh/shunting314/259/orig -> origin/gh/shunting314/259/orig 2025-12-04T08:56:59.4719358Z * [new branch] gh/shunting314/260/base -> origin/gh/shunting314/260/base 2025-12-04T08:56:59.4720909Z * [new branch] gh/shunting314/260/head -> origin/gh/shunting314/260/head 2025-12-04T08:56:59.4722631Z * [new branch] gh/shunting314/260/orig -> origin/gh/shunting314/260/orig 2025-12-04T08:56:59.4724849Z * [new branch] gh/shunting314/261/base -> origin/gh/shunting314/261/base 2025-12-04T08:56:59.4726493Z * [new branch] gh/shunting314/261/head -> origin/gh/shunting314/261/head 2025-12-04T08:56:59.4728180Z * [new branch] gh/shunting314/261/orig -> origin/gh/shunting314/261/orig 2025-12-04T08:56:59.4730318Z * [new branch] gh/shunting314/262/base -> origin/gh/shunting314/262/base 2025-12-04T08:56:59.4731947Z * [new branch] gh/shunting314/262/head -> origin/gh/shunting314/262/head 2025-12-04T08:56:59.4733625Z * [new branch] gh/shunting314/262/orig -> origin/gh/shunting314/262/orig 2025-12-04T08:56:59.4735811Z * [new branch] gh/shunting314/263/base -> origin/gh/shunting314/263/base 2025-12-04T08:56:59.4737488Z * [new branch] gh/shunting314/263/head -> origin/gh/shunting314/263/head 2025-12-04T08:56:59.4739037Z * [new branch] gh/shunting314/263/orig -> origin/gh/shunting314/263/orig 2025-12-04T08:56:59.4741609Z * [new branch] gh/shunting314/264/base -> origin/gh/shunting314/264/base 2025-12-04T08:56:59.4743184Z * [new branch] gh/shunting314/264/head -> origin/gh/shunting314/264/head 2025-12-04T08:56:59.4744838Z * [new branch] gh/shunting314/264/orig -> origin/gh/shunting314/264/orig 2025-12-04T08:56:59.4746836Z * [new branch] gh/shunting314/265/base -> origin/gh/shunting314/265/base 2025-12-04T08:56:59.4748299Z * [new branch] gh/shunting314/265/head -> origin/gh/shunting314/265/head 2025-12-04T08:56:59.4749837Z * [new branch] gh/shunting314/265/orig -> origin/gh/shunting314/265/orig 2025-12-04T08:56:59.4751877Z * [new branch] gh/shunting314/266/base -> origin/gh/shunting314/266/base 2025-12-04T08:56:59.4753610Z * [new branch] gh/shunting314/266/head -> origin/gh/shunting314/266/head 2025-12-04T08:56:59.4755165Z * [new branch] gh/shunting314/266/orig -> origin/gh/shunting314/266/orig 2025-12-04T08:56:59.4757557Z * [new branch] gh/shunting314/267/base -> origin/gh/shunting314/267/base 2025-12-04T08:56:59.4759211Z * [new branch] gh/shunting314/267/head -> origin/gh/shunting314/267/head 2025-12-04T08:56:59.4761169Z * [new branch] gh/shunting314/267/orig -> origin/gh/shunting314/267/orig 2025-12-04T08:56:59.4764485Z * [new branch] gh/shunting314/268/base -> origin/gh/shunting314/268/base 2025-12-04T08:56:59.4766099Z * [new branch] gh/shunting314/268/head -> origin/gh/shunting314/268/head 2025-12-04T08:56:59.4767660Z * [new branch] gh/shunting314/268/orig -> origin/gh/shunting314/268/orig 2025-12-04T08:56:59.4769887Z * [new branch] gh/shunting314/269/base -> origin/gh/shunting314/269/base 2025-12-04T08:56:59.4771419Z * [new branch] gh/shunting314/269/head -> origin/gh/shunting314/269/head 2025-12-04T08:56:59.4773085Z * [new branch] gh/shunting314/269/orig -> origin/gh/shunting314/269/orig 2025-12-04T08:56:59.4775524Z * [new branch] gh/silverguo/1/base -> origin/gh/silverguo/1/base 2025-12-04T08:56:59.4777091Z * [new branch] gh/silverguo/1/head -> origin/gh/silverguo/1/head 2025-12-04T08:56:59.4779086Z * [new branch] gh/silverguo/2/base -> origin/gh/silverguo/2/base 2025-12-04T08:56:59.4780556Z * [new branch] gh/silverguo/2/head -> origin/gh/silverguo/2/head 2025-12-04T08:56:59.4783372Z * [new branch] gh/silverguo/3/base -> origin/gh/silverguo/3/base 2025-12-04T08:56:59.4784926Z * [new branch] gh/silverguo/3/head -> origin/gh/silverguo/3/head 2025-12-04T08:56:59.4786798Z * [new branch] gh/silverguo/4/base -> origin/gh/silverguo/4/base 2025-12-04T08:56:59.4788328Z * [new branch] gh/silverguo/4/head -> origin/gh/silverguo/4/head 2025-12-04T08:56:59.4790795Z * [new branch] gh/slayton58/39/base -> origin/gh/slayton58/39/base 2025-12-04T08:56:59.4792410Z * [new branch] gh/slayton58/39/head -> origin/gh/slayton58/39/head 2025-12-04T08:56:59.4794106Z * [new branch] gh/slayton58/39/orig -> origin/gh/slayton58/39/orig 2025-12-04T08:56:59.4796138Z * [new branch] gh/slayton58/42/base -> origin/gh/slayton58/42/base 2025-12-04T08:56:59.4797759Z * [new branch] gh/slayton58/42/head -> origin/gh/slayton58/42/head 2025-12-04T08:56:59.4799394Z * [new branch] gh/slayton58/42/orig -> origin/gh/slayton58/42/orig 2025-12-04T08:56:59.4801387Z * [new branch] gh/slayton58/43/base -> origin/gh/slayton58/43/base 2025-12-04T08:56:59.4802939Z * [new branch] gh/slayton58/43/head -> origin/gh/slayton58/43/head 2025-12-04T08:56:59.4804495Z * [new branch] gh/slayton58/43/orig -> origin/gh/slayton58/43/orig 2025-12-04T08:56:59.4806719Z * [new branch] gh/slayton58/44/base -> origin/gh/slayton58/44/base 2025-12-04T08:56:59.4808258Z * [new branch] gh/slayton58/44/head -> origin/gh/slayton58/44/head 2025-12-04T08:56:59.4809964Z * [new branch] gh/slayton58/44/orig -> origin/gh/slayton58/44/orig 2025-12-04T08:56:59.4811900Z * [new branch] gh/slayton58/45/base -> origin/gh/slayton58/45/base 2025-12-04T08:56:59.4813562Z * [new branch] gh/slayton58/45/head -> origin/gh/slayton58/45/head 2025-12-04T08:56:59.4815180Z * [new branch] gh/slayton58/45/orig -> origin/gh/slayton58/45/orig 2025-12-04T08:56:59.4817249Z * [new branch] gh/slayton58/46/base -> origin/gh/slayton58/46/base 2025-12-04T08:56:59.4818961Z * [new branch] gh/slayton58/46/head -> origin/gh/slayton58/46/head 2025-12-04T08:56:59.4820518Z * [new branch] gh/slayton58/46/orig -> origin/gh/slayton58/46/orig 2025-12-04T08:56:59.4822581Z * [new branch] gh/slayton58/6/base -> origin/gh/slayton58/6/base 2025-12-04T08:56:59.4824133Z * [new branch] gh/slayton58/6/head -> origin/gh/slayton58/6/head 2025-12-04T08:56:59.4826127Z * [new branch] gh/slayton58/7/base -> origin/gh/slayton58/7/base 2025-12-04T08:56:59.4827627Z * [new branch] gh/slayton58/7/head -> origin/gh/slayton58/7/head 2025-12-04T08:56:59.4830328Z * [new branch] gh/soulitzer/269/base -> origin/gh/soulitzer/269/base 2025-12-04T08:56:59.4832234Z * [new branch] gh/soulitzer/269/head -> origin/gh/soulitzer/269/head 2025-12-04T08:56:59.4833882Z * [new branch] gh/soulitzer/269/orig -> origin/gh/soulitzer/269/orig 2025-12-04T08:56:59.4836034Z * [new branch] gh/soulitzer/276/base -> origin/gh/soulitzer/276/base 2025-12-04T08:56:59.4837609Z * [new branch] gh/soulitzer/276/head -> origin/gh/soulitzer/276/head 2025-12-04T08:56:59.4839232Z * [new branch] gh/soulitzer/276/orig -> origin/gh/soulitzer/276/orig 2025-12-04T08:56:59.4841622Z * [new branch] gh/soulitzer/287/base -> origin/gh/soulitzer/287/base 2025-12-04T08:56:59.4843265Z * [new branch] gh/soulitzer/287/head -> origin/gh/soulitzer/287/head 2025-12-04T08:56:59.4844822Z * [new branch] gh/soulitzer/287/orig -> origin/gh/soulitzer/287/orig 2025-12-04T08:56:59.4847044Z * [new branch] gh/soulitzer/296/base -> origin/gh/soulitzer/296/base 2025-12-04T08:56:59.4848625Z * [new branch] gh/soulitzer/296/head -> origin/gh/soulitzer/296/head 2025-12-04T08:56:59.4850299Z * [new branch] gh/soulitzer/296/orig -> origin/gh/soulitzer/296/orig 2025-12-04T08:56:59.4852413Z * [new branch] gh/soulitzer/299/base -> origin/gh/soulitzer/299/base 2025-12-04T08:56:59.4854020Z * [new branch] gh/soulitzer/299/head -> origin/gh/soulitzer/299/head 2025-12-04T08:56:59.4855543Z * [new branch] gh/soulitzer/299/orig -> origin/gh/soulitzer/299/orig 2025-12-04T08:56:59.4857663Z * [new branch] gh/soulitzer/300/base -> origin/gh/soulitzer/300/base 2025-12-04T08:56:59.4859278Z * [new branch] gh/soulitzer/300/head -> origin/gh/soulitzer/300/head 2025-12-04T08:56:59.4860903Z * [new branch] gh/soulitzer/300/orig -> origin/gh/soulitzer/300/orig 2025-12-04T08:56:59.4864842Z * [new branch] gh/soulitzer/301/base -> origin/gh/soulitzer/301/base 2025-12-04T08:56:59.4866501Z * [new branch] gh/soulitzer/301/head -> origin/gh/soulitzer/301/head 2025-12-04T08:56:59.4868037Z * [new branch] gh/soulitzer/301/orig -> origin/gh/soulitzer/301/orig 2025-12-04T08:56:59.4870120Z * [new branch] gh/soulitzer/313/base -> origin/gh/soulitzer/313/base 2025-12-04T08:56:59.4871716Z * [new branch] gh/soulitzer/313/head -> origin/gh/soulitzer/313/head 2025-12-04T08:56:59.4873169Z * [new branch] gh/soulitzer/313/orig -> origin/gh/soulitzer/313/orig 2025-12-04T08:56:59.4875416Z * [new branch] gh/soulitzer/319/base -> origin/gh/soulitzer/319/base 2025-12-04T08:56:59.4876927Z * [new branch] gh/soulitzer/319/head -> origin/gh/soulitzer/319/head 2025-12-04T08:56:59.4878503Z * [new branch] gh/soulitzer/319/orig -> origin/gh/soulitzer/319/orig 2025-12-04T08:56:59.4880635Z * [new branch] gh/soulitzer/320/base -> origin/gh/soulitzer/320/base 2025-12-04T08:56:59.4882548Z * [new branch] gh/soulitzer/320/head -> origin/gh/soulitzer/320/head 2025-12-04T08:56:59.4884125Z * [new branch] gh/soulitzer/320/orig -> origin/gh/soulitzer/320/orig 2025-12-04T08:56:59.4886376Z * [new branch] gh/soulitzer/336/base -> origin/gh/soulitzer/336/base 2025-12-04T08:56:59.4887817Z * [new branch] gh/soulitzer/336/head -> origin/gh/soulitzer/336/head 2025-12-04T08:56:59.4889330Z * [new branch] gh/soulitzer/336/orig -> origin/gh/soulitzer/336/orig 2025-12-04T08:56:59.4891474Z * [new branch] gh/soulitzer/347/base -> origin/gh/soulitzer/347/base 2025-12-04T08:56:59.4893340Z * [new branch] gh/soulitzer/347/head -> origin/gh/soulitzer/347/head 2025-12-04T08:56:59.4895178Z * [new branch] gh/soulitzer/347/orig -> origin/gh/soulitzer/347/orig 2025-12-04T08:56:59.4897399Z * [new branch] gh/soulitzer/349/base -> origin/gh/soulitzer/349/base 2025-12-04T08:56:59.4898963Z * [new branch] gh/soulitzer/349/head -> origin/gh/soulitzer/349/head 2025-12-04T08:56:59.4900559Z * [new branch] gh/soulitzer/349/orig -> origin/gh/soulitzer/349/orig 2025-12-04T08:56:59.4902568Z * [new branch] gh/soulitzer/350/base -> origin/gh/soulitzer/350/base 2025-12-04T08:56:59.4903978Z * [new branch] gh/soulitzer/350/head -> origin/gh/soulitzer/350/head 2025-12-04T08:56:59.4905554Z * [new branch] gh/soulitzer/350/orig -> origin/gh/soulitzer/350/orig 2025-12-04T08:56:59.4907723Z * [new branch] gh/soulitzer/351/base -> origin/gh/soulitzer/351/base 2025-12-04T08:56:59.4909263Z * [new branch] gh/soulitzer/351/head -> origin/gh/soulitzer/351/head 2025-12-04T08:56:59.4910810Z * [new branch] gh/soulitzer/351/orig -> origin/gh/soulitzer/351/orig 2025-12-04T08:56:59.4912917Z * [new branch] gh/soulitzer/353/base -> origin/gh/soulitzer/353/base 2025-12-04T08:56:59.4914568Z * [new branch] gh/soulitzer/353/head -> origin/gh/soulitzer/353/head 2025-12-04T08:56:59.4916087Z * [new branch] gh/soulitzer/353/orig -> origin/gh/soulitzer/353/orig 2025-12-04T08:56:59.4918963Z * [new branch] gh/soulitzer/358/base -> origin/gh/soulitzer/358/base 2025-12-04T08:56:59.4920612Z * [new branch] gh/soulitzer/358/head -> origin/gh/soulitzer/358/head 2025-12-04T08:56:59.4922200Z * [new branch] gh/soulitzer/358/orig -> origin/gh/soulitzer/358/orig 2025-12-04T08:56:59.4924852Z * [new branch] gh/soulitzer/359/base -> origin/gh/soulitzer/359/base 2025-12-04T08:56:59.4926373Z * [new branch] gh/soulitzer/359/head -> origin/gh/soulitzer/359/head 2025-12-04T08:56:59.4928096Z * [new branch] gh/soulitzer/359/orig -> origin/gh/soulitzer/359/orig 2025-12-04T08:56:59.4930241Z * [new branch] gh/soulitzer/374/base -> origin/gh/soulitzer/374/base 2025-12-04T08:56:59.4931752Z * [new branch] gh/soulitzer/374/head -> origin/gh/soulitzer/374/head 2025-12-04T08:56:59.4933471Z * [new branch] gh/soulitzer/374/orig -> origin/gh/soulitzer/374/orig 2025-12-04T08:56:59.4935563Z * [new branch] gh/soulitzer/375/base -> origin/gh/soulitzer/375/base 2025-12-04T08:56:59.4937046Z * [new branch] gh/soulitzer/375/head -> origin/gh/soulitzer/375/head 2025-12-04T08:56:59.4938720Z * [new branch] gh/soulitzer/375/orig -> origin/gh/soulitzer/375/orig 2025-12-04T08:56:59.4940794Z * [new branch] gh/soulitzer/380/base -> origin/gh/soulitzer/380/base 2025-12-04T08:56:59.4942292Z * [new branch] gh/soulitzer/380/head -> origin/gh/soulitzer/380/head 2025-12-04T08:56:59.4943786Z * [new branch] gh/soulitzer/380/orig -> origin/gh/soulitzer/380/orig 2025-12-04T08:56:59.4945923Z * [new branch] gh/soulitzer/385/base -> origin/gh/soulitzer/385/base 2025-12-04T08:56:59.4947479Z * [new branch] gh/soulitzer/385/head -> origin/gh/soulitzer/385/head 2025-12-04T08:56:59.4949138Z * [new branch] gh/soulitzer/385/orig -> origin/gh/soulitzer/385/orig 2025-12-04T08:56:59.4951239Z * [new branch] gh/soulitzer/386/base -> origin/gh/soulitzer/386/base 2025-12-04T08:56:59.4952813Z * [new branch] gh/soulitzer/386/head -> origin/gh/soulitzer/386/head 2025-12-04T08:56:59.4954382Z * [new branch] gh/soulitzer/386/orig -> origin/gh/soulitzer/386/orig 2025-12-04T08:56:59.4956535Z * [new branch] gh/soulitzer/387/base -> origin/gh/soulitzer/387/base 2025-12-04T08:56:59.4958020Z * [new branch] gh/soulitzer/387/head -> origin/gh/soulitzer/387/head 2025-12-04T08:56:59.4959530Z * [new branch] gh/soulitzer/387/orig -> origin/gh/soulitzer/387/orig 2025-12-04T08:56:59.4961726Z * [new branch] gh/soulitzer/388/base -> origin/gh/soulitzer/388/base 2025-12-04T08:56:59.4963535Z * [new branch] gh/soulitzer/388/head -> origin/gh/soulitzer/388/head 2025-12-04T08:56:59.4965265Z * [new branch] gh/soulitzer/388/orig -> origin/gh/soulitzer/388/orig 2025-12-04T08:56:59.4967385Z * [new branch] gh/soulitzer/389/base -> origin/gh/soulitzer/389/base 2025-12-04T08:56:59.4968893Z * [new branch] gh/soulitzer/389/head -> origin/gh/soulitzer/389/head 2025-12-04T08:56:59.4970550Z * [new branch] gh/soulitzer/389/orig -> origin/gh/soulitzer/389/orig 2025-12-04T08:56:59.4972710Z * [new branch] gh/soulitzer/390/base -> origin/gh/soulitzer/390/base 2025-12-04T08:56:59.4974292Z * [new branch] gh/soulitzer/390/head -> origin/gh/soulitzer/390/head 2025-12-04T08:56:59.4975856Z * [new branch] gh/soulitzer/390/orig -> origin/gh/soulitzer/390/orig 2025-12-04T08:56:59.4977953Z * [new branch] gh/soulitzer/391/base -> origin/gh/soulitzer/391/base 2025-12-04T08:56:59.4979538Z * [new branch] gh/soulitzer/391/head -> origin/gh/soulitzer/391/head 2025-12-04T08:56:59.4981050Z * [new branch] gh/soulitzer/391/orig -> origin/gh/soulitzer/391/orig 2025-12-04T08:56:59.4983208Z * [new branch] gh/soulitzer/392/base -> origin/gh/soulitzer/392/base 2025-12-04T08:56:59.4984721Z * [new branch] gh/soulitzer/392/head -> origin/gh/soulitzer/392/head 2025-12-04T08:56:59.4986271Z * [new branch] gh/soulitzer/392/orig -> origin/gh/soulitzer/392/orig 2025-12-04T08:56:59.4988879Z * [new branch] gh/swolchok/728/next -> origin/gh/swolchok/728/next 2025-12-04T08:56:59.4991212Z * [new branch] gh/swolchok/819/base -> origin/gh/swolchok/819/base 2025-12-04T08:56:59.4992883Z * [new branch] gh/swolchok/819/head -> origin/gh/swolchok/819/head 2025-12-04T08:56:59.4994400Z * [new branch] gh/swolchok/819/orig -> origin/gh/swolchok/819/orig 2025-12-04T08:56:59.4996464Z * [new branch] gh/swolchok/824/base -> origin/gh/swolchok/824/base 2025-12-04T08:56:59.4997986Z * [new branch] gh/swolchok/824/head -> origin/gh/swolchok/824/head 2025-12-04T08:56:59.4999529Z * [new branch] gh/swolchok/824/orig -> origin/gh/swolchok/824/orig 2025-12-04T08:56:59.5001860Z * [new branch] gh/swolchok/829/base -> origin/gh/swolchok/829/base 2025-12-04T08:56:59.5003027Z * [new branch] gh/swolchok/829/head -> origin/gh/swolchok/829/head 2025-12-04T08:56:59.5004665Z * [new branch] gh/swolchok/829/orig -> origin/gh/swolchok/829/orig 2025-12-04T08:56:59.5006744Z * [new branch] gh/swolchok/839/base -> origin/gh/swolchok/839/base 2025-12-04T08:56:59.5008264Z * [new branch] gh/swolchok/839/head -> origin/gh/swolchok/839/head 2025-12-04T08:56:59.5009781Z * [new branch] gh/swolchok/839/orig -> origin/gh/swolchok/839/orig 2025-12-04T08:56:59.5011959Z * [new branch] gh/swolchok/841/base -> origin/gh/swolchok/841/base 2025-12-04T08:56:59.5013655Z * [new branch] gh/swolchok/841/head -> origin/gh/swolchok/841/head 2025-12-04T08:56:59.5015204Z * [new branch] gh/swolchok/841/orig -> origin/gh/swolchok/841/orig 2025-12-04T08:56:59.5017266Z * [new branch] gh/swolchok/842/base -> origin/gh/swolchok/842/base 2025-12-04T08:56:59.5018831Z * [new branch] gh/swolchok/842/head -> origin/gh/swolchok/842/head 2025-12-04T08:56:59.5020366Z * [new branch] gh/swolchok/842/orig -> origin/gh/swolchok/842/orig 2025-12-04T08:56:59.5022420Z * [new branch] gh/swolchok/845/base -> origin/gh/swolchok/845/base 2025-12-04T08:56:59.5023930Z * [new branch] gh/swolchok/845/head -> origin/gh/swolchok/845/head 2025-12-04T08:56:59.5025631Z * [new branch] gh/swolchok/845/orig -> origin/gh/swolchok/845/orig 2025-12-04T08:56:59.5027779Z * [new branch] gh/swolchok/848/base -> origin/gh/swolchok/848/base 2025-12-04T08:56:59.5029385Z * [new branch] gh/swolchok/848/head -> origin/gh/swolchok/848/head 2025-12-04T08:56:59.5030935Z * [new branch] gh/swolchok/848/orig -> origin/gh/swolchok/848/orig 2025-12-04T08:56:59.5033105Z * [new branch] gh/swolchok/856/base -> origin/gh/swolchok/856/base 2025-12-04T08:56:59.5034643Z * [new branch] gh/swolchok/856/head -> origin/gh/swolchok/856/head 2025-12-04T08:56:59.5036176Z * [new branch] gh/swolchok/856/orig -> origin/gh/swolchok/856/orig 2025-12-04T08:56:59.5038364Z * [new branch] gh/swolchok/860/base -> origin/gh/swolchok/860/base 2025-12-04T08:56:59.5039972Z * [new branch] gh/swolchok/860/head -> origin/gh/swolchok/860/head 2025-12-04T08:56:59.5041446Z * [new branch] gh/swolchok/860/orig -> origin/gh/swolchok/860/orig 2025-12-04T08:56:59.5043714Z * [new branch] gh/swolchok/861/base -> origin/gh/swolchok/861/base 2025-12-04T08:56:59.5045368Z * [new branch] gh/swolchok/861/head -> origin/gh/swolchok/861/head 2025-12-04T08:56:59.5046953Z * [new branch] gh/swolchok/861/orig -> origin/gh/swolchok/861/orig 2025-12-04T08:56:59.5049111Z * [new branch] gh/swolchok/862/base -> origin/gh/swolchok/862/base 2025-12-04T08:56:59.5050633Z * [new branch] gh/swolchok/862/head -> origin/gh/swolchok/862/head 2025-12-04T08:56:59.5052132Z * [new branch] gh/swolchok/862/orig -> origin/gh/swolchok/862/orig 2025-12-04T08:56:59.5055066Z * [new branch] gh/swolchok/863/base -> origin/gh/swolchok/863/base 2025-12-04T08:56:59.5056801Z * [new branch] gh/swolchok/863/head -> origin/gh/swolchok/863/head 2025-12-04T08:56:59.5058437Z * [new branch] gh/swolchok/863/orig -> origin/gh/swolchok/863/orig 2025-12-04T08:56:59.5060605Z * [new branch] gh/swolchok/864/base -> origin/gh/swolchok/864/base 2025-12-04T08:56:59.5062200Z * [new branch] gh/swolchok/864/head -> origin/gh/swolchok/864/head 2025-12-04T08:56:59.5064056Z * [new branch] gh/swolchok/864/orig -> origin/gh/swolchok/864/orig 2025-12-04T08:56:59.5066057Z * [new branch] gh/swolchok/865/base -> origin/gh/swolchok/865/base 2025-12-04T08:56:59.5067752Z * [new branch] gh/swolchok/865/head -> origin/gh/swolchok/865/head 2025-12-04T08:56:59.5069328Z * [new branch] gh/swolchok/865/orig -> origin/gh/swolchok/865/orig 2025-12-04T08:56:59.5071967Z * [new branch] gh/swolchok/866/base -> origin/gh/swolchok/866/base 2025-12-04T08:56:59.5073574Z * [new branch] gh/swolchok/866/head -> origin/gh/swolchok/866/head 2025-12-04T08:56:59.5075258Z * [new branch] gh/swolchok/866/orig -> origin/gh/swolchok/866/orig 2025-12-04T08:56:59.5077457Z * [new branch] gh/swolchok/867/base -> origin/gh/swolchok/867/base 2025-12-04T08:56:59.5079009Z * [new branch] gh/swolchok/867/head -> origin/gh/swolchok/867/head 2025-12-04T08:56:59.5080552Z * [new branch] gh/swolchok/867/orig -> origin/gh/swolchok/867/orig 2025-12-04T08:56:59.5082674Z * [new branch] gh/swolchok/868/base -> origin/gh/swolchok/868/base 2025-12-04T08:56:59.5084210Z * [new branch] gh/swolchok/868/head -> origin/gh/swolchok/868/head 2025-12-04T08:56:59.5085745Z * [new branch] gh/swolchok/868/orig -> origin/gh/swolchok/868/orig 2025-12-04T08:56:59.5087869Z * [new branch] gh/swolchok/869/base -> origin/gh/swolchok/869/base 2025-12-04T08:56:59.5089491Z * [new branch] gh/swolchok/869/head -> origin/gh/swolchok/869/head 2025-12-04T08:56:59.5091110Z * [new branch] gh/swolchok/869/orig -> origin/gh/swolchok/869/orig 2025-12-04T08:56:59.5093533Z * [new branch] gh/swolchok/870/base -> origin/gh/swolchok/870/base 2025-12-04T08:56:59.5094963Z * [new branch] gh/swolchok/870/head -> origin/gh/swolchok/870/head 2025-12-04T08:56:59.5096476Z * [new branch] gh/swolchok/870/orig -> origin/gh/swolchok/870/orig 2025-12-04T08:56:59.5098746Z * [new branch] gh/swolchok/871/base -> origin/gh/swolchok/871/base 2025-12-04T08:56:59.5100321Z * [new branch] gh/swolchok/871/head -> origin/gh/swolchok/871/head 2025-12-04T08:56:59.5101911Z * [new branch] gh/swolchok/871/orig -> origin/gh/swolchok/871/orig 2025-12-04T08:56:59.5104658Z * [new branch] gh/teja-rao/4/base -> origin/gh/teja-rao/4/base 2025-12-04T08:56:59.5106289Z * [new branch] gh/teja-rao/4/head -> origin/gh/teja-rao/4/head 2025-12-04T08:56:59.5107840Z * [new branch] gh/teja-rao/4/orig -> origin/gh/teja-rao/4/orig 2025-12-04T08:56:59.5110364Z * [new branch] gh/tianyu-l/2/base -> origin/gh/tianyu-l/2/base 2025-12-04T08:56:59.5111965Z * [new branch] gh/tianyu-l/2/head -> origin/gh/tianyu-l/2/head 2025-12-04T08:56:59.5113503Z * [new branch] gh/tianyu-l/2/orig -> origin/gh/tianyu-l/2/orig 2025-12-04T08:56:59.5115525Z * [new branch] gh/tianyu-l/3/base -> origin/gh/tianyu-l/3/base 2025-12-04T08:56:59.5117159Z * [new branch] gh/tianyu-l/3/orig -> origin/gh/tianyu-l/3/orig 2025-12-04T08:56:59.5119181Z * [new branch] gh/tianyu-l/4/base -> origin/gh/tianyu-l/4/base 2025-12-04T08:56:59.5120687Z * [new branch] gh/tianyu-l/4/head -> origin/gh/tianyu-l/4/head 2025-12-04T08:56:59.5122331Z * [new branch] gh/tianyu-l/4/orig -> origin/gh/tianyu-l/4/orig 2025-12-04T08:56:59.5125168Z * [new branch] gh/tugsbayasgalan/10/base -> origin/gh/tugsbayasgalan/10/base 2025-12-04T08:56:59.5126788Z * [new branch] gh/tugsbayasgalan/10/head -> origin/gh/tugsbayasgalan/10/head 2025-12-04T08:56:59.5128562Z * [new branch] gh/tugsbayasgalan/10/orig -> origin/gh/tugsbayasgalan/10/orig 2025-12-04T08:56:59.5130591Z * [new branch] gh/tugsbayasgalan/13/base -> origin/gh/tugsbayasgalan/13/base 2025-12-04T08:56:59.5132118Z * [new branch] gh/tugsbayasgalan/13/head -> origin/gh/tugsbayasgalan/13/head 2025-12-04T08:56:59.5133851Z * [new branch] gh/tugsbayasgalan/13/orig -> origin/gh/tugsbayasgalan/13/orig 2025-12-04T08:56:59.5136023Z * [new branch] gh/tugsbayasgalan/17/base -> origin/gh/tugsbayasgalan/17/base 2025-12-04T08:56:59.5137585Z * [new branch] gh/tugsbayasgalan/17/head -> origin/gh/tugsbayasgalan/17/head 2025-12-04T08:56:59.5139112Z * [new branch] gh/tugsbayasgalan/17/orig -> origin/gh/tugsbayasgalan/17/orig 2025-12-04T08:56:59.5141282Z * [new branch] gh/tugsbayasgalan/2/base -> origin/gh/tugsbayasgalan/2/base 2025-12-04T08:56:59.5142860Z * [new branch] gh/tugsbayasgalan/2/head -> origin/gh/tugsbayasgalan/2/head 2025-12-04T08:56:59.5144432Z * [new branch] gh/tugsbayasgalan/2/orig -> origin/gh/tugsbayasgalan/2/orig 2025-12-04T08:56:59.5146679Z * [new branch] gh/tugsbayasgalan/28/base -> origin/gh/tugsbayasgalan/28/base 2025-12-04T08:56:59.5148279Z * [new branch] gh/tugsbayasgalan/28/head -> origin/gh/tugsbayasgalan/28/head 2025-12-04T08:56:59.5149838Z * [new branch] gh/tugsbayasgalan/28/orig -> origin/gh/tugsbayasgalan/28/orig 2025-12-04T08:56:59.5151926Z * [new branch] gh/tugsbayasgalan/32/base -> origin/gh/tugsbayasgalan/32/base 2025-12-04T08:56:59.5153498Z * [new branch] gh/tugsbayasgalan/32/head -> origin/gh/tugsbayasgalan/32/head 2025-12-04T08:56:59.5155030Z * [new branch] gh/tugsbayasgalan/32/orig -> origin/gh/tugsbayasgalan/32/orig 2025-12-04T08:56:59.5157162Z * [new branch] gh/tugsbayasgalan/35/base -> origin/gh/tugsbayasgalan/35/base 2025-12-04T08:56:59.5158815Z * [new branch] gh/tugsbayasgalan/35/head -> origin/gh/tugsbayasgalan/35/head 2025-12-04T08:56:59.5160393Z * [new branch] gh/tugsbayasgalan/35/orig -> origin/gh/tugsbayasgalan/35/orig 2025-12-04T08:56:59.5162472Z * [new branch] gh/tugsbayasgalan/36/base -> origin/gh/tugsbayasgalan/36/base 2025-12-04T08:56:59.5164174Z * [new branch] gh/tugsbayasgalan/36/head -> origin/gh/tugsbayasgalan/36/head 2025-12-04T08:56:59.5165701Z * [new branch] gh/tugsbayasgalan/36/orig -> origin/gh/tugsbayasgalan/36/orig 2025-12-04T08:56:59.5168232Z * [new branch] gh/tugsbayasgalan/37/base -> origin/gh/tugsbayasgalan/37/base 2025-12-04T08:56:59.5169806Z * [new branch] gh/tugsbayasgalan/37/head -> origin/gh/tugsbayasgalan/37/head 2025-12-04T08:56:59.5171372Z * [new branch] gh/tugsbayasgalan/37/orig -> origin/gh/tugsbayasgalan/37/orig 2025-12-04T08:56:59.5173653Z * [new branch] gh/tugsbayasgalan/43/base -> origin/gh/tugsbayasgalan/43/base 2025-12-04T08:56:59.5175145Z * [new branch] gh/tugsbayasgalan/43/head -> origin/gh/tugsbayasgalan/43/head 2025-12-04T08:56:59.5176785Z * [new branch] gh/tugsbayasgalan/43/orig -> origin/gh/tugsbayasgalan/43/orig 2025-12-04T08:56:59.5178749Z * [new branch] gh/tugsbayasgalan/48/base -> origin/gh/tugsbayasgalan/48/base 2025-12-04T08:56:59.5180403Z * [new branch] gh/tugsbayasgalan/48/head -> origin/gh/tugsbayasgalan/48/head 2025-12-04T08:56:59.5181927Z * [new branch] gh/tugsbayasgalan/48/orig -> origin/gh/tugsbayasgalan/48/orig 2025-12-04T08:56:59.5183956Z * [new branch] gh/tugsbayasgalan/51/base -> origin/gh/tugsbayasgalan/51/base 2025-12-04T08:56:59.5185505Z * [new branch] gh/tugsbayasgalan/51/head -> origin/gh/tugsbayasgalan/51/head 2025-12-04T08:56:59.5187172Z * [new branch] gh/tugsbayasgalan/51/orig -> origin/gh/tugsbayasgalan/51/orig 2025-12-04T08:56:59.5189040Z * [new branch] gh/tugsbayasgalan/52/base -> origin/gh/tugsbayasgalan/52/base 2025-12-04T08:56:59.5190570Z * [new branch] gh/tugsbayasgalan/52/head -> origin/gh/tugsbayasgalan/52/head 2025-12-04T08:56:59.5192145Z * [new branch] gh/tugsbayasgalan/52/orig -> origin/gh/tugsbayasgalan/52/orig 2025-12-04T08:56:59.5194318Z * [new branch] gh/tugsbayasgalan/53/base -> origin/gh/tugsbayasgalan/53/base 2025-12-04T08:56:59.5195834Z * [new branch] gh/tugsbayasgalan/53/head -> origin/gh/tugsbayasgalan/53/head 2025-12-04T08:56:59.5197376Z * [new branch] gh/tugsbayasgalan/53/orig -> origin/gh/tugsbayasgalan/53/orig 2025-12-04T08:56:59.5199590Z * [new branch] gh/tugsbayasgalan/55/base -> origin/gh/tugsbayasgalan/55/base 2025-12-04T08:56:59.5201277Z * [new branch] gh/tugsbayasgalan/55/head -> origin/gh/tugsbayasgalan/55/head 2025-12-04T08:56:59.5202969Z * [new branch] gh/tugsbayasgalan/55/orig -> origin/gh/tugsbayasgalan/55/orig 2025-12-04T08:56:59.5205111Z * [new branch] gh/tugsbayasgalan/59/base -> origin/gh/tugsbayasgalan/59/base 2025-12-04T08:56:59.5206721Z * [new branch] gh/tugsbayasgalan/59/head -> origin/gh/tugsbayasgalan/59/head 2025-12-04T08:56:59.5208482Z * [new branch] gh/tugsbayasgalan/59/orig -> origin/gh/tugsbayasgalan/59/orig 2025-12-04T08:56:59.5210628Z * [new branch] gh/tugsbayasgalan/6/base -> origin/gh/tugsbayasgalan/6/base 2025-12-04T08:56:59.5212128Z * [new branch] gh/tugsbayasgalan/6/head -> origin/gh/tugsbayasgalan/6/head 2025-12-04T08:56:59.5213842Z * [new branch] gh/tugsbayasgalan/6/orig -> origin/gh/tugsbayasgalan/6/orig 2025-12-04T08:56:59.5215757Z * [new branch] gh/tugsbayasgalan/60/base -> origin/gh/tugsbayasgalan/60/base 2025-12-04T08:56:59.5217326Z * [new branch] gh/tugsbayasgalan/60/head -> origin/gh/tugsbayasgalan/60/head 2025-12-04T08:56:59.5218840Z * [new branch] gh/tugsbayasgalan/60/orig -> origin/gh/tugsbayasgalan/60/orig 2025-12-04T08:56:59.5221402Z * [new branch] gh/tugsbayasgalan/61/base -> origin/gh/tugsbayasgalan/61/base 2025-12-04T08:56:59.5222999Z * [new branch] gh/tugsbayasgalan/61/head -> origin/gh/tugsbayasgalan/61/head 2025-12-04T08:56:59.5224554Z * [new branch] gh/tugsbayasgalan/61/orig -> origin/gh/tugsbayasgalan/61/orig 2025-12-04T08:56:59.5226916Z * [new branch] gh/tugsbayasgalan/63/base -> origin/gh/tugsbayasgalan/63/base 2025-12-04T08:56:59.5228555Z * [new branch] gh/tugsbayasgalan/63/head -> origin/gh/tugsbayasgalan/63/head 2025-12-04T08:56:59.5230066Z * [new branch] gh/tugsbayasgalan/63/orig -> origin/gh/tugsbayasgalan/63/orig 2025-12-04T08:56:59.5232287Z * [new branch] gh/tugsbayasgalan/67/base -> origin/gh/tugsbayasgalan/67/base 2025-12-04T08:56:59.5234012Z * [new branch] gh/tugsbayasgalan/67/head -> origin/gh/tugsbayasgalan/67/head 2025-12-04T08:56:59.5235325Z * [new branch] gh/tugsbayasgalan/67/orig -> origin/gh/tugsbayasgalan/67/orig 2025-12-04T08:56:59.5250280Z * [new branch] gh/tugsbayasgalan/68/base -> origin/gh/tugsbayasgalan/68/base 2025-12-04T08:56:59.5250521Z * [new branch] gh/tugsbayasgalan/68/head -> origin/gh/tugsbayasgalan/68/head 2025-12-04T08:56:59.5250709Z * [new branch] gh/tugsbayasgalan/68/orig -> origin/gh/tugsbayasgalan/68/orig 2025-12-04T08:56:59.5250890Z * [new branch] gh/tugsbayasgalan/7/base -> origin/gh/tugsbayasgalan/7/base 2025-12-04T08:56:59.5251054Z * [new branch] gh/tugsbayasgalan/7/head -> origin/gh/tugsbayasgalan/7/head 2025-12-04T08:56:59.5251218Z * [new branch] gh/tugsbayasgalan/7/orig -> origin/gh/tugsbayasgalan/7/orig 2025-12-04T08:56:59.5251515Z * [new branch] gh/tugsbayasgalan/70/base -> origin/gh/tugsbayasgalan/70/base 2025-12-04T08:56:59.5251768Z * [new branch] gh/tugsbayasgalan/70/head -> origin/gh/tugsbayasgalan/70/head 2025-12-04T08:56:59.5251975Z * [new branch] gh/tugsbayasgalan/70/orig -> origin/gh/tugsbayasgalan/70/orig 2025-12-04T08:56:59.5254563Z * [new branch] gh/tugsbayasgalan/71/base -> origin/gh/tugsbayasgalan/71/base 2025-12-04T08:56:59.5256212Z * [new branch] gh/tugsbayasgalan/71/head -> origin/gh/tugsbayasgalan/71/head 2025-12-04T08:56:59.5257793Z * [new branch] gh/tugsbayasgalan/71/orig -> origin/gh/tugsbayasgalan/71/orig 2025-12-04T08:56:59.5260158Z * [new branch] gh/tugsbayasgalan/72/base -> origin/gh/tugsbayasgalan/72/base 2025-12-04T08:56:59.5261680Z * [new branch] gh/tugsbayasgalan/72/head -> origin/gh/tugsbayasgalan/72/head 2025-12-04T08:56:59.5264895Z * [new branch] gh/tugsbayasgalan/72/orig -> origin/gh/tugsbayasgalan/72/orig 2025-12-04T08:56:59.5266999Z * [new branch] gh/tugsbayasgalan/73/base -> origin/gh/tugsbayasgalan/73/base 2025-12-04T08:56:59.5268742Z * [new branch] gh/tugsbayasgalan/73/head -> origin/gh/tugsbayasgalan/73/head 2025-12-04T08:56:59.5270757Z * [new branch] gh/tugsbayasgalan/73/orig -> origin/gh/tugsbayasgalan/73/orig 2025-12-04T08:56:59.5273198Z * [new branch] gh/tugsbayasgalan/74/base -> origin/gh/tugsbayasgalan/74/base 2025-12-04T08:56:59.5274781Z * [new branch] gh/tugsbayasgalan/74/head -> origin/gh/tugsbayasgalan/74/head 2025-12-04T08:56:59.5276474Z * [new branch] gh/tugsbayasgalan/74/orig -> origin/gh/tugsbayasgalan/74/orig 2025-12-04T08:56:59.5278619Z * [new branch] gh/tugsbayasgalan/75/base -> origin/gh/tugsbayasgalan/75/base 2025-12-04T08:56:59.5280163Z * [new branch] gh/tugsbayasgalan/75/head -> origin/gh/tugsbayasgalan/75/head 2025-12-04T08:56:59.5281682Z * [new branch] gh/tugsbayasgalan/75/orig -> origin/gh/tugsbayasgalan/75/orig 2025-12-04T08:56:59.5283620Z * [new branch] gh/tugsbayasgalan/76/base -> origin/gh/tugsbayasgalan/76/base 2025-12-04T08:56:59.5285152Z * [new branch] gh/tugsbayasgalan/76/head -> origin/gh/tugsbayasgalan/76/head 2025-12-04T08:56:59.5287095Z * [new branch] gh/tugsbayasgalan/76/orig -> origin/gh/tugsbayasgalan/76/orig 2025-12-04T08:56:59.5289402Z * [new branch] gh/tugsbayasgalan/77/base -> origin/gh/tugsbayasgalan/77/base 2025-12-04T08:56:59.5291218Z * [new branch] gh/tugsbayasgalan/77/head -> origin/gh/tugsbayasgalan/77/head 2025-12-04T08:56:59.5292957Z * [new branch] gh/tugsbayasgalan/77/orig -> origin/gh/tugsbayasgalan/77/orig 2025-12-04T08:56:59.5295106Z * [new branch] gh/tugsbayasgalan/78/base -> origin/gh/tugsbayasgalan/78/base 2025-12-04T08:56:59.5296754Z * [new branch] gh/tugsbayasgalan/78/head -> origin/gh/tugsbayasgalan/78/head 2025-12-04T08:56:59.5298317Z * [new branch] gh/tugsbayasgalan/78/orig -> origin/gh/tugsbayasgalan/78/orig 2025-12-04T08:56:59.5300442Z * [new branch] gh/tugsbayasgalan/79/base -> origin/gh/tugsbayasgalan/79/base 2025-12-04T08:56:59.5301988Z * [new branch] gh/tugsbayasgalan/79/head -> origin/gh/tugsbayasgalan/79/head 2025-12-04T08:56:59.5303463Z * [new branch] gh/tugsbayasgalan/79/orig -> origin/gh/tugsbayasgalan/79/orig 2025-12-04T08:56:59.5305651Z * [new branch] gh/tugsbayasgalan/8/base -> origin/gh/tugsbayasgalan/8/base 2025-12-04T08:56:59.5307193Z * [new branch] gh/tugsbayasgalan/8/head -> origin/gh/tugsbayasgalan/8/head 2025-12-04T08:56:59.5308786Z * [new branch] gh/tugsbayasgalan/8/orig -> origin/gh/tugsbayasgalan/8/orig 2025-12-04T08:56:59.5311031Z * [new branch] gh/tugsbayasgalan/80/base -> origin/gh/tugsbayasgalan/80/base 2025-12-04T08:56:59.5312498Z * [new branch] gh/tugsbayasgalan/80/head -> origin/gh/tugsbayasgalan/80/head 2025-12-04T08:56:59.5313985Z * [new branch] gh/tugsbayasgalan/80/orig -> origin/gh/tugsbayasgalan/80/orig 2025-12-04T08:56:59.5316189Z * [new branch] gh/tugsbayasgalan/81/base -> origin/gh/tugsbayasgalan/81/base 2025-12-04T08:56:59.5317653Z * [new branch] gh/tugsbayasgalan/81/head -> origin/gh/tugsbayasgalan/81/head 2025-12-04T08:56:59.5319223Z * [new branch] gh/tugsbayasgalan/81/orig -> origin/gh/tugsbayasgalan/81/orig 2025-12-04T08:56:59.5321796Z * [new branch] gh/tugsbayasgalan/82/base -> origin/gh/tugsbayasgalan/82/base 2025-12-04T08:56:59.5323575Z * [new branch] gh/tugsbayasgalan/82/head -> origin/gh/tugsbayasgalan/82/head 2025-12-04T08:56:59.5325185Z * [new branch] gh/tugsbayasgalan/82/orig -> origin/gh/tugsbayasgalan/82/orig 2025-12-04T08:56:59.5327208Z * [new branch] gh/tugsbayasgalan/83/base -> origin/gh/tugsbayasgalan/83/base 2025-12-04T08:56:59.5328737Z * [new branch] gh/tugsbayasgalan/83/head -> origin/gh/tugsbayasgalan/83/head 2025-12-04T08:56:59.5330283Z * [new branch] gh/tugsbayasgalan/83/orig -> origin/gh/tugsbayasgalan/83/orig 2025-12-04T08:56:59.5332350Z * [new branch] gh/tugsbayasgalan/84/base -> origin/gh/tugsbayasgalan/84/base 2025-12-04T08:56:59.5334139Z * [new branch] gh/tugsbayasgalan/84/head -> origin/gh/tugsbayasgalan/84/head 2025-12-04T08:56:59.5335645Z * [new branch] gh/tugsbayasgalan/84/orig -> origin/gh/tugsbayasgalan/84/orig 2025-12-04T08:56:59.5337666Z * [new branch] gh/tugsbayasgalan/85/base -> origin/gh/tugsbayasgalan/85/base 2025-12-04T08:56:59.5339276Z * [new branch] gh/tugsbayasgalan/85/head -> origin/gh/tugsbayasgalan/85/head 2025-12-04T08:56:59.5340808Z * [new branch] gh/tugsbayasgalan/85/orig -> origin/gh/tugsbayasgalan/85/orig 2025-12-04T08:56:59.5342978Z * [new branch] gh/tugsbayasgalan/86/base -> origin/gh/tugsbayasgalan/86/base 2025-12-04T08:56:59.5344531Z * [new branch] gh/tugsbayasgalan/86/head -> origin/gh/tugsbayasgalan/86/head 2025-12-04T08:56:59.5346085Z * [new branch] gh/tugsbayasgalan/86/orig -> origin/gh/tugsbayasgalan/86/orig 2025-12-04T08:56:59.5348511Z * [new branch] gh/tugsbayasgalan/87/base -> origin/gh/tugsbayasgalan/87/base 2025-12-04T08:56:59.5350096Z * [new branch] gh/tugsbayasgalan/87/head -> origin/gh/tugsbayasgalan/87/head 2025-12-04T08:56:59.5351639Z * [new branch] gh/tugsbayasgalan/87/orig -> origin/gh/tugsbayasgalan/87/orig 2025-12-04T08:56:59.5353935Z * [new branch] gh/tugsbayasgalan/88/base -> origin/gh/tugsbayasgalan/88/base 2025-12-04T08:56:59.5355495Z * [new branch] gh/tugsbayasgalan/88/head -> origin/gh/tugsbayasgalan/88/head 2025-12-04T08:56:59.5357022Z * [new branch] gh/tugsbayasgalan/88/orig -> origin/gh/tugsbayasgalan/88/orig 2025-12-04T08:56:59.5359206Z * [new branch] gh/tugsbayasgalan/89/base -> origin/gh/tugsbayasgalan/89/base 2025-12-04T08:56:59.5360797Z * [new branch] gh/tugsbayasgalan/89/head -> origin/gh/tugsbayasgalan/89/head 2025-12-04T08:56:59.5362526Z * [new branch] gh/tugsbayasgalan/89/orig -> origin/gh/tugsbayasgalan/89/orig 2025-12-04T08:56:59.5364615Z * [new branch] gh/tugsbayasgalan/9/base -> origin/gh/tugsbayasgalan/9/base 2025-12-04T08:56:59.5366012Z * [new branch] gh/tugsbayasgalan/9/head -> origin/gh/tugsbayasgalan/9/head 2025-12-04T08:56:59.5367629Z * [new branch] gh/tugsbayasgalan/9/orig -> origin/gh/tugsbayasgalan/9/orig 2025-12-04T08:56:59.5369897Z * [new branch] gh/tugsbayasgalan/90/base -> origin/gh/tugsbayasgalan/90/base 2025-12-04T08:56:59.5371523Z * [new branch] gh/tugsbayasgalan/90/head -> origin/gh/tugsbayasgalan/90/head 2025-12-04T08:56:59.5373098Z * [new branch] gh/tugsbayasgalan/90/orig -> origin/gh/tugsbayasgalan/90/orig 2025-12-04T08:56:59.5375896Z * [new branch] gh/tugsbayasgalan/91/base -> origin/gh/tugsbayasgalan/91/base 2025-12-04T08:56:59.5377343Z * [new branch] gh/tugsbayasgalan/91/head -> origin/gh/tugsbayasgalan/91/head 2025-12-04T08:56:59.5378945Z * [new branch] gh/tugsbayasgalan/91/orig -> origin/gh/tugsbayasgalan/91/orig 2025-12-04T08:56:59.5381109Z * [new branch] gh/tugsbayasgalan/92/base -> origin/gh/tugsbayasgalan/92/base 2025-12-04T08:56:59.5382687Z * [new branch] gh/tugsbayasgalan/92/head -> origin/gh/tugsbayasgalan/92/head 2025-12-04T08:56:59.5384274Z * [new branch] gh/tugsbayasgalan/92/orig -> origin/gh/tugsbayasgalan/92/orig 2025-12-04T08:56:59.5386539Z * [new branch] gh/tugsbayasgalan/93/base -> origin/gh/tugsbayasgalan/93/base 2025-12-04T08:56:59.5388168Z * [new branch] gh/tugsbayasgalan/93/head -> origin/gh/tugsbayasgalan/93/head 2025-12-04T08:56:59.5389753Z * [new branch] gh/tugsbayasgalan/93/orig -> origin/gh/tugsbayasgalan/93/orig 2025-12-04T08:56:59.5392278Z * [new branch] gh/v0i0/14/base -> origin/gh/v0i0/14/base 2025-12-04T08:56:59.5393788Z * [new branch] gh/v0i0/14/head -> origin/gh/v0i0/14/head 2025-12-04T08:56:59.5395334Z * [new branch] gh/v0i0/14/orig -> origin/gh/v0i0/14/orig 2025-12-04T08:56:59.5397362Z * [new branch] gh/v0i0/15/base -> origin/gh/v0i0/15/base 2025-12-04T08:56:59.5398974Z * [new branch] gh/v0i0/15/head -> origin/gh/v0i0/15/head 2025-12-04T08:56:59.5400511Z * [new branch] gh/v0i0/15/orig -> origin/gh/v0i0/15/orig 2025-12-04T08:56:59.5402603Z * [new branch] gh/v0i0/16/base -> origin/gh/v0i0/16/base 2025-12-04T08:56:59.5404207Z * [new branch] gh/v0i0/16/head -> origin/gh/v0i0/16/head 2025-12-04T08:56:59.5405704Z * [new branch] gh/v0i0/16/orig -> origin/gh/v0i0/16/orig 2025-12-04T08:56:59.5407749Z * [new branch] gh/v0i0/17/base -> origin/gh/v0i0/17/base 2025-12-04T08:56:59.5409323Z * [new branch] gh/v0i0/17/head -> origin/gh/v0i0/17/head 2025-12-04T08:56:59.5410836Z * [new branch] gh/v0i0/17/orig -> origin/gh/v0i0/17/orig 2025-12-04T08:56:59.5413630Z * [new branch] gh/v0i0/18/base -> origin/gh/v0i0/18/base 2025-12-04T08:56:59.5415184Z * [new branch] gh/v0i0/18/head -> origin/gh/v0i0/18/head 2025-12-04T08:56:59.5416734Z * [new branch] gh/v0i0/18/orig -> origin/gh/v0i0/18/orig 2025-12-04T08:56:59.5419332Z * [new branch] gh/v0i0/19/base -> origin/gh/v0i0/19/base 2025-12-04T08:56:59.5420878Z * [new branch] gh/v0i0/19/head -> origin/gh/v0i0/19/head 2025-12-04T08:56:59.5422531Z * [new branch] gh/v0i0/19/orig -> origin/gh/v0i0/19/orig 2025-12-04T08:56:59.5425163Z * [new branch] gh/vishal9-team/1/base -> origin/gh/vishal9-team/1/base 2025-12-04T08:56:59.5426756Z * [new branch] gh/vishal9-team/1/head -> origin/gh/vishal9-team/1/head 2025-12-04T08:56:59.5428728Z * [new branch] gh/vishal9-team/2/base -> origin/gh/vishal9-team/2/base 2025-12-04T08:56:59.5430279Z * [new branch] gh/vishal9-team/2/head -> origin/gh/vishal9-team/2/head 2025-12-04T08:56:59.5431833Z * [new branch] gh/vishal9-team/2/orig -> origin/gh/vishal9-team/2/orig 2025-12-04T08:56:59.5433832Z * [new branch] gh/vishal9-team/3/base -> origin/gh/vishal9-team/3/base 2025-12-04T08:56:59.5435905Z * [new branch] gh/vishal9-team/3/head -> origin/gh/vishal9-team/3/head 2025-12-04T08:56:59.5437426Z * [new branch] gh/vishal9-team/3/orig -> origin/gh/vishal9-team/3/orig 2025-12-04T08:56:59.5439475Z * [new branch] gh/vishal9-team/4/base -> origin/gh/vishal9-team/4/base 2025-12-04T08:56:59.5441009Z * [new branch] gh/vishal9-team/4/head -> origin/gh/vishal9-team/4/head 2025-12-04T08:56:59.5442602Z * [new branch] gh/vishal9-team/4/orig -> origin/gh/vishal9-team/4/orig 2025-12-04T08:56:59.5445116Z * [new branch] gh/vkuzo/1/next -> origin/gh/vkuzo/1/next 2025-12-04T08:56:59.5447195Z * [new branch] gh/vkuzo/2/next -> origin/gh/vkuzo/2/next 2025-12-04T08:56:59.5449212Z * [new branch] gh/vkuzo/3/next -> origin/gh/vkuzo/3/next 2025-12-04T08:56:59.5451719Z * [new branch] gh/wconstab/424/base -> origin/gh/wconstab/424/base 2025-12-04T08:56:59.5453451Z * [new branch] gh/wconstab/424/head -> origin/gh/wconstab/424/head 2025-12-04T08:56:59.5455359Z * [new branch] gh/wconstab/424/orig -> origin/gh/wconstab/424/orig 2025-12-04T08:56:59.5457549Z * [new branch] gh/wconstab/435/base -> origin/gh/wconstab/435/base 2025-12-04T08:56:59.5459224Z * [new branch] gh/wconstab/435/head -> origin/gh/wconstab/435/head 2025-12-04T08:56:59.5460569Z * [new branch] gh/wconstab/435/orig -> origin/gh/wconstab/435/orig 2025-12-04T08:56:59.5462829Z * [new branch] gh/wconstab/444/base -> origin/gh/wconstab/444/base 2025-12-04T08:56:59.5464358Z * [new branch] gh/wconstab/444/head -> origin/gh/wconstab/444/head 2025-12-04T08:56:59.5465910Z * [new branch] gh/wconstab/444/orig -> origin/gh/wconstab/444/orig 2025-12-04T08:56:59.5468468Z * [new branch] gh/wconstab/447/base -> origin/gh/wconstab/447/base 2025-12-04T08:56:59.5470103Z * [new branch] gh/wconstab/447/head -> origin/gh/wconstab/447/head 2025-12-04T08:56:59.5471666Z * [new branch] gh/wconstab/447/orig -> origin/gh/wconstab/447/orig 2025-12-04T08:56:59.5473687Z * [new branch] gh/wconstab/448/base -> origin/gh/wconstab/448/base 2025-12-04T08:56:59.5475283Z * [new branch] gh/wconstab/448/head -> origin/gh/wconstab/448/head 2025-12-04T08:56:59.5476866Z * [new branch] gh/wconstab/448/orig -> origin/gh/wconstab/448/orig 2025-12-04T08:56:59.5479032Z * [new branch] gh/wconstab/449/base -> origin/gh/wconstab/449/base 2025-12-04T08:56:59.5480635Z * [new branch] gh/wconstab/449/head -> origin/gh/wconstab/449/head 2025-12-04T08:56:59.5482201Z * [new branch] gh/wconstab/449/orig -> origin/gh/wconstab/449/orig 2025-12-04T08:56:59.5484188Z * [new branch] gh/wconstab/450/base -> origin/gh/wconstab/450/base 2025-12-04T08:56:59.5485786Z * [new branch] gh/wconstab/450/head -> origin/gh/wconstab/450/head 2025-12-04T08:56:59.5487370Z * [new branch] gh/wconstab/450/orig -> origin/gh/wconstab/450/orig 2025-12-04T08:56:59.5489252Z * [new branch] gh/wconstab/451/base -> origin/gh/wconstab/451/base 2025-12-04T08:56:59.5490878Z * [new branch] gh/wconstab/451/head -> origin/gh/wconstab/451/head 2025-12-04T08:56:59.5492489Z * [new branch] gh/wconstab/451/orig -> origin/gh/wconstab/451/orig 2025-12-04T08:56:59.5494668Z * [new branch] gh/wconstab/452/base -> origin/gh/wconstab/452/base 2025-12-04T08:56:59.5496187Z * [new branch] gh/wconstab/452/head -> origin/gh/wconstab/452/head 2025-12-04T08:56:59.5497724Z * [new branch] gh/wconstab/452/orig -> origin/gh/wconstab/452/orig 2025-12-04T08:56:59.5499850Z * [new branch] gh/wconstab/453/base -> origin/gh/wconstab/453/base 2025-12-04T08:56:59.5501463Z * [new branch] gh/wconstab/453/head -> origin/gh/wconstab/453/head 2025-12-04T08:56:59.5503013Z * [new branch] gh/wconstab/453/orig -> origin/gh/wconstab/453/orig 2025-12-04T08:56:59.5504971Z * [new branch] gh/wconstab/454/base -> origin/gh/wconstab/454/base 2025-12-04T08:56:59.5506482Z * [new branch] gh/wconstab/454/head -> origin/gh/wconstab/454/head 2025-12-04T08:56:59.5508048Z * [new branch] gh/wconstab/454/orig -> origin/gh/wconstab/454/orig 2025-12-04T08:56:59.5510149Z * [new branch] gh/wconstab/455/base -> origin/gh/wconstab/455/base 2025-12-04T08:56:59.5511701Z * [new branch] gh/wconstab/455/head -> origin/gh/wconstab/455/head 2025-12-04T08:56:59.5513244Z * [new branch] gh/wconstab/455/orig -> origin/gh/wconstab/455/orig 2025-12-04T08:56:59.5515524Z * [new branch] gh/wconstab/456/base -> origin/gh/wconstab/456/base 2025-12-04T08:56:59.5517308Z * [new branch] gh/wconstab/456/head -> origin/gh/wconstab/456/head 2025-12-04T08:56:59.5518939Z * [new branch] gh/wconstab/456/orig -> origin/gh/wconstab/456/orig 2025-12-04T08:56:59.5521178Z * [new branch] gh/wconstab/457/base -> origin/gh/wconstab/457/base 2025-12-04T08:56:59.5522746Z * [new branch] gh/wconstab/457/head -> origin/gh/wconstab/457/head 2025-12-04T08:56:59.5524404Z * [new branch] gh/wconstab/457/orig -> origin/gh/wconstab/457/orig 2025-12-04T08:56:59.5526457Z * [new branch] gh/wconstab/458/base -> origin/gh/wconstab/458/base 2025-12-04T08:56:59.5528070Z * [new branch] gh/wconstab/458/head -> origin/gh/wconstab/458/head 2025-12-04T08:56:59.5529616Z * [new branch] gh/wconstab/458/orig -> origin/gh/wconstab/458/orig 2025-12-04T08:56:59.5531630Z * [new branch] gh/wconstab/459/base -> origin/gh/wconstab/459/base 2025-12-04T08:56:59.5533354Z * [new branch] gh/wconstab/459/head -> origin/gh/wconstab/459/head 2025-12-04T08:56:59.5534836Z * [new branch] gh/wconstab/459/orig -> origin/gh/wconstab/459/orig 2025-12-04T08:56:59.5537468Z * [new branch] gh/wconstab/460/base -> origin/gh/wconstab/460/base 2025-12-04T08:56:59.5539221Z * [new branch] gh/wconstab/460/head -> origin/gh/wconstab/460/head 2025-12-04T08:56:59.5540815Z * [new branch] gh/wconstab/460/orig -> origin/gh/wconstab/460/orig 2025-12-04T08:56:59.5543140Z * [new branch] gh/wconstab/461/base -> origin/gh/wconstab/461/base 2025-12-04T08:56:59.5544679Z * [new branch] gh/wconstab/461/head -> origin/gh/wconstab/461/head 2025-12-04T08:56:59.5546262Z * [new branch] gh/wconstab/461/orig -> origin/gh/wconstab/461/orig 2025-12-04T08:56:59.5548227Z * [new branch] gh/wconstab/462/base -> origin/gh/wconstab/462/base 2025-12-04T08:56:59.5549905Z * [new branch] gh/wconstab/462/head -> origin/gh/wconstab/462/head 2025-12-04T08:56:59.5551620Z * [new branch] gh/wconstab/462/orig -> origin/gh/wconstab/462/orig 2025-12-04T08:56:59.5553847Z * [new branch] gh/wconstab/463/base -> origin/gh/wconstab/463/base 2025-12-04T08:56:59.5555383Z * [new branch] gh/wconstab/463/head -> origin/gh/wconstab/463/head 2025-12-04T08:56:59.5556920Z * [new branch] gh/wconstab/463/orig -> origin/gh/wconstab/463/orig 2025-12-04T08:56:59.5559079Z * [new branch] gh/wconstab/464/base -> origin/gh/wconstab/464/base 2025-12-04T08:56:59.5560690Z * [new branch] gh/wconstab/464/head -> origin/gh/wconstab/464/head 2025-12-04T08:56:59.5562517Z * [new branch] gh/wconstab/464/orig -> origin/gh/wconstab/464/orig 2025-12-04T08:56:59.5565174Z * [new branch] gh/wconstab/465/base -> origin/gh/wconstab/465/base 2025-12-04T08:56:59.5566706Z * [new branch] gh/wconstab/465/head -> origin/gh/wconstab/465/head 2025-12-04T08:56:59.5568270Z * [new branch] gh/wconstab/465/orig -> origin/gh/wconstab/465/orig 2025-12-04T08:56:59.5570484Z * [new branch] gh/wconstab/466/base -> origin/gh/wconstab/466/base 2025-12-04T08:56:59.5572018Z * [new branch] gh/wconstab/466/head -> origin/gh/wconstab/466/head 2025-12-04T08:56:59.5573627Z * [new branch] gh/wconstab/466/orig -> origin/gh/wconstab/466/orig 2025-12-04T08:56:59.5576089Z * [new branch] gh/wconstab/467/base -> origin/gh/wconstab/467/base 2025-12-04T08:56:59.5577584Z * [new branch] gh/wconstab/467/head -> origin/gh/wconstab/467/head 2025-12-04T08:56:59.5579639Z * [new branch] gh/wconstab/467/orig -> origin/gh/wconstab/467/orig 2025-12-04T08:56:59.5581722Z * [new branch] gh/wconstab/468/base -> origin/gh/wconstab/468/base 2025-12-04T08:56:59.5583285Z * [new branch] gh/wconstab/468/head -> origin/gh/wconstab/468/head 2025-12-04T08:56:59.5584761Z * [new branch] gh/wconstab/468/orig -> origin/gh/wconstab/468/orig 2025-12-04T08:56:59.5587375Z * [new branch] gh/weifengpy/39/base -> origin/gh/weifengpy/39/base 2025-12-04T08:56:59.5588849Z * [new branch] gh/weifengpy/39/head -> origin/gh/weifengpy/39/head 2025-12-04T08:56:59.5590436Z * [new branch] gh/weifengpy/39/orig -> origin/gh/weifengpy/39/orig 2025-12-04T08:56:59.5592608Z * [new branch] gh/weifengpy/40/base -> origin/gh/weifengpy/40/base 2025-12-04T08:56:59.5594107Z * [new branch] gh/weifengpy/40/head -> origin/gh/weifengpy/40/head 2025-12-04T08:56:59.5595636Z * [new branch] gh/weifengpy/40/orig -> origin/gh/weifengpy/40/orig 2025-12-04T08:56:59.5597902Z * [new branch] gh/weifengpy/41/base -> origin/gh/weifengpy/41/base 2025-12-04T08:56:59.5599443Z * [new branch] gh/weifengpy/41/head -> origin/gh/weifengpy/41/head 2025-12-04T08:56:59.5601098Z * [new branch] gh/weifengpy/41/orig -> origin/gh/weifengpy/41/orig 2025-12-04T08:56:59.5604162Z * [new branch] gh/williamwen42/250/base -> origin/gh/williamwen42/250/base 2025-12-04T08:56:59.5605752Z * [new branch] gh/williamwen42/250/head -> origin/gh/williamwen42/250/head 2025-12-04T08:56:59.5607435Z * [new branch] gh/williamwen42/250/orig -> origin/gh/williamwen42/250/orig 2025-12-04T08:56:59.5609556Z * [new branch] gh/williamwen42/279/base -> origin/gh/williamwen42/279/base 2025-12-04T08:56:59.5611276Z * [new branch] gh/williamwen42/279/head -> origin/gh/williamwen42/279/head 2025-12-04T08:56:59.5612996Z * [new branch] gh/williamwen42/279/orig -> origin/gh/williamwen42/279/orig 2025-12-04T08:56:59.5615109Z * [new branch] gh/williamwen42/282/base -> origin/gh/williamwen42/282/base 2025-12-04T08:56:59.5616623Z * [new branch] gh/williamwen42/282/head -> origin/gh/williamwen42/282/head 2025-12-04T08:56:59.5618146Z * [new branch] gh/williamwen42/282/orig -> origin/gh/williamwen42/282/orig 2025-12-04T08:56:59.5620340Z * [new branch] gh/williamwen42/287/base -> origin/gh/williamwen42/287/base 2025-12-04T08:56:59.5621916Z * [new branch] gh/williamwen42/287/head -> origin/gh/williamwen42/287/head 2025-12-04T08:56:59.5623427Z * [new branch] gh/williamwen42/287/orig -> origin/gh/williamwen42/287/orig 2025-12-04T08:56:59.5625518Z * [new branch] gh/williamwen42/288/base -> origin/gh/williamwen42/288/base 2025-12-04T08:56:59.5627168Z * [new branch] gh/williamwen42/288/head -> origin/gh/williamwen42/288/head 2025-12-04T08:56:59.5628717Z * [new branch] gh/williamwen42/288/orig -> origin/gh/williamwen42/288/orig 2025-12-04T08:56:59.5630946Z * [new branch] gh/williamwen42/296/base -> origin/gh/williamwen42/296/base 2025-12-04T08:56:59.5632628Z * [new branch] gh/williamwen42/296/head -> origin/gh/williamwen42/296/head 2025-12-04T08:56:59.5634162Z * [new branch] gh/williamwen42/296/orig -> origin/gh/williamwen42/296/orig 2025-12-04T08:56:59.5636151Z * [new branch] gh/williamwen42/297/base -> origin/gh/williamwen42/297/base 2025-12-04T08:56:59.5637711Z * [new branch] gh/williamwen42/297/head -> origin/gh/williamwen42/297/head 2025-12-04T08:56:59.5639260Z * [new branch] gh/williamwen42/297/orig -> origin/gh/williamwen42/297/orig 2025-12-04T08:56:59.5641387Z * [new branch] gh/williamwen42/306/base -> origin/gh/williamwen42/306/base 2025-12-04T08:56:59.5642989Z * [new branch] gh/williamwen42/306/head -> origin/gh/williamwen42/306/head 2025-12-04T08:56:59.5644600Z * [new branch] gh/williamwen42/306/orig -> origin/gh/williamwen42/306/orig 2025-12-04T08:56:59.5646759Z * [new branch] gh/williamwen42/309/base -> origin/gh/williamwen42/309/base 2025-12-04T08:56:59.5650242Z * [new branch] gh/williamwen42/309/head -> origin/gh/williamwen42/309/head 2025-12-04T08:56:59.5650586Z * [new branch] gh/williamwen42/309/orig -> origin/gh/williamwen42/309/orig 2025-12-04T08:56:59.5652800Z * [new branch] gh/williamwen42/310/base -> origin/gh/williamwen42/310/base 2025-12-04T08:56:59.5654428Z * [new branch] gh/williamwen42/310/head -> origin/gh/williamwen42/310/head 2025-12-04T08:56:59.5655940Z * [new branch] gh/williamwen42/310/orig -> origin/gh/williamwen42/310/orig 2025-12-04T08:56:59.5659579Z * [new branch] gh/williamwen42/311/base -> origin/gh/williamwen42/311/base 2025-12-04T08:56:59.5661122Z * [new branch] gh/williamwen42/311/head -> origin/gh/williamwen42/311/head 2025-12-04T08:56:59.5664353Z * [new branch] gh/williamwen42/311/orig -> origin/gh/williamwen42/311/orig 2025-12-04T08:56:59.5666491Z * [new branch] gh/williamwen42/319/base -> origin/gh/williamwen42/319/base 2025-12-04T08:56:59.5668225Z * [new branch] gh/williamwen42/319/head -> origin/gh/williamwen42/319/head 2025-12-04T08:56:59.5669744Z * [new branch] gh/williamwen42/319/orig -> origin/gh/williamwen42/319/orig 2025-12-04T08:56:59.5671872Z * [new branch] gh/williamwen42/325/base -> origin/gh/williamwen42/325/base 2025-12-04T08:56:59.5673408Z * [new branch] gh/williamwen42/325/head -> origin/gh/williamwen42/325/head 2025-12-04T08:56:59.5675046Z * [new branch] gh/williamwen42/325/orig -> origin/gh/williamwen42/325/orig 2025-12-04T08:56:59.5677100Z * [new branch] gh/williamwen42/326/base -> origin/gh/williamwen42/326/base 2025-12-04T08:56:59.5678871Z * [new branch] gh/williamwen42/326/head -> origin/gh/williamwen42/326/head 2025-12-04T08:56:59.5680871Z * [new branch] gh/williamwen42/326/orig -> origin/gh/williamwen42/326/orig 2025-12-04T08:56:59.5682987Z * [new branch] gh/williamwen42/327/base -> origin/gh/williamwen42/327/base 2025-12-04T08:56:59.5684540Z * [new branch] gh/williamwen42/327/head -> origin/gh/williamwen42/327/head 2025-12-04T08:56:59.5686054Z * [new branch] gh/williamwen42/327/orig -> origin/gh/williamwen42/327/orig 2025-12-04T08:56:59.5688232Z * [new branch] gh/williamwen42/328/base -> origin/gh/williamwen42/328/base 2025-12-04T08:56:59.5689788Z * [new branch] gh/williamwen42/328/head -> origin/gh/williamwen42/328/head 2025-12-04T08:56:59.5691513Z * [new branch] gh/williamwen42/328/orig -> origin/gh/williamwen42/328/orig 2025-12-04T08:56:59.5694157Z * [new branch] gh/williamwen42/329/base -> origin/gh/williamwen42/329/base 2025-12-04T08:56:59.5695715Z * [new branch] gh/williamwen42/329/head -> origin/gh/williamwen42/329/head 2025-12-04T08:56:59.5697308Z * [new branch] gh/williamwen42/329/orig -> origin/gh/williamwen42/329/orig 2025-12-04T08:56:59.5699550Z * [new branch] gh/williamwen42/330/base -> origin/gh/williamwen42/330/base 2025-12-04T08:56:59.5701108Z * [new branch] gh/williamwen42/330/head -> origin/gh/williamwen42/330/head 2025-12-04T08:56:59.5702702Z * [new branch] gh/williamwen42/330/orig -> origin/gh/williamwen42/330/orig 2025-12-04T08:56:59.5704836Z * [new branch] gh/williamwen42/331/base -> origin/gh/williamwen42/331/base 2025-12-04T08:56:59.5706415Z * [new branch] gh/williamwen42/331/head -> origin/gh/williamwen42/331/head 2025-12-04T08:56:59.5707933Z * [new branch] gh/williamwen42/331/orig -> origin/gh/williamwen42/331/orig 2025-12-04T08:56:59.5710608Z * [new branch] gh/williamwen42/332/base -> origin/gh/williamwen42/332/base 2025-12-04T08:56:59.5712204Z * [new branch] gh/williamwen42/332/head -> origin/gh/williamwen42/332/head 2025-12-04T08:56:59.5713773Z * [new branch] gh/williamwen42/332/orig -> origin/gh/williamwen42/332/orig 2025-12-04T08:56:59.5716162Z * [new branch] gh/williamwen42/333/base -> origin/gh/williamwen42/333/base 2025-12-04T08:56:59.5717680Z * [new branch] gh/williamwen42/333/head -> origin/gh/williamwen42/333/head 2025-12-04T08:56:59.5719385Z * [new branch] gh/williamwen42/333/orig -> origin/gh/williamwen42/333/orig 2025-12-04T08:56:59.5721575Z * [new branch] gh/williamwen42/334/base -> origin/gh/williamwen42/334/base 2025-12-04T08:56:59.5723089Z * [new branch] gh/williamwen42/334/head -> origin/gh/williamwen42/334/head 2025-12-04T08:56:59.5724646Z * [new branch] gh/williamwen42/334/orig -> origin/gh/williamwen42/334/orig 2025-12-04T08:56:59.5727380Z * [new branch] gh/williamwen42/335/base -> origin/gh/williamwen42/335/base 2025-12-04T08:56:59.5732109Z * [new branch] gh/williamwen42/335/head -> origin/gh/williamwen42/335/head 2025-12-04T08:56:59.5733993Z * [new branch] gh/williamwen42/335/orig -> origin/gh/williamwen42/335/orig 2025-12-04T08:56:59.5736130Z * [new branch] gh/williamwen42/336/base -> origin/gh/williamwen42/336/base 2025-12-04T08:56:59.5737624Z * [new branch] gh/williamwen42/336/head -> origin/gh/williamwen42/336/head 2025-12-04T08:56:59.5739130Z * [new branch] gh/williamwen42/336/orig -> origin/gh/williamwen42/336/orig 2025-12-04T08:56:59.5741284Z * [new branch] gh/williamwen42/337/base -> origin/gh/williamwen42/337/base 2025-12-04T08:56:59.5742819Z * [new branch] gh/williamwen42/337/head -> origin/gh/williamwen42/337/head 2025-12-04T08:56:59.5744442Z * [new branch] gh/williamwen42/337/orig -> origin/gh/williamwen42/337/orig 2025-12-04T08:56:59.5746700Z * [new branch] gh/williamwen42/338/base -> origin/gh/williamwen42/338/base 2025-12-04T08:56:59.5748243Z * [new branch] gh/williamwen42/338/head -> origin/gh/williamwen42/338/head 2025-12-04T08:56:59.5749777Z * [new branch] gh/williamwen42/338/orig -> origin/gh/williamwen42/338/orig 2025-12-04T08:56:59.5751903Z * [new branch] gh/williamwen42/339/base -> origin/gh/williamwen42/339/base 2025-12-04T08:56:59.5753459Z * [new branch] gh/williamwen42/339/head -> origin/gh/williamwen42/339/head 2025-12-04T08:56:59.5754985Z * [new branch] gh/williamwen42/339/orig -> origin/gh/williamwen42/339/orig 2025-12-04T08:56:59.5757328Z * [new branch] gh/williamwen42/340/base -> origin/gh/williamwen42/340/base 2025-12-04T08:56:59.5758710Z * [new branch] gh/williamwen42/340/head -> origin/gh/williamwen42/340/head 2025-12-04T08:56:59.5760185Z * [new branch] gh/williamwen42/340/orig -> origin/gh/williamwen42/340/orig 2025-12-04T08:56:59.5762431Z * [new branch] gh/williamwen42/341/base -> origin/gh/williamwen42/341/base 2025-12-04T08:56:59.5764606Z * [new branch] gh/williamwen42/341/head -> origin/gh/williamwen42/341/head 2025-12-04T08:56:59.5766252Z * [new branch] gh/williamwen42/341/orig -> origin/gh/williamwen42/341/orig 2025-12-04T08:56:59.5768388Z * [new branch] gh/williamwen42/342/base -> origin/gh/williamwen42/342/base 2025-12-04T08:56:59.5769952Z * [new branch] gh/williamwen42/342/head -> origin/gh/williamwen42/342/head 2025-12-04T08:56:59.5771475Z * [new branch] gh/williamwen42/342/orig -> origin/gh/williamwen42/342/orig 2025-12-04T08:56:59.5773909Z * [new branch] gh/williamwen42/343/base -> origin/gh/williamwen42/343/base 2025-12-04T08:56:59.5775394Z * [new branch] gh/williamwen42/343/head -> origin/gh/williamwen42/343/head 2025-12-04T08:56:59.5777020Z * [new branch] gh/williamwen42/343/orig -> origin/gh/williamwen42/343/orig 2025-12-04T08:56:59.5779193Z * [new branch] gh/williamwen42/344/base -> origin/gh/williamwen42/344/base 2025-12-04T08:56:59.5780749Z * [new branch] gh/williamwen42/344/head -> origin/gh/williamwen42/344/head 2025-12-04T08:56:59.5782259Z * [new branch] gh/williamwen42/344/orig -> origin/gh/williamwen42/344/orig 2025-12-04T08:56:59.5784576Z * [new branch] gh/williamwen42/345/base -> origin/gh/williamwen42/345/base 2025-12-04T08:56:59.5786163Z * [new branch] gh/williamwen42/345/head -> origin/gh/williamwen42/345/head 2025-12-04T08:56:59.5787862Z * [new branch] gh/williamwen42/345/orig -> origin/gh/williamwen42/345/orig 2025-12-04T08:56:59.5789963Z * [new branch] gh/williamwen42/346/base -> origin/gh/williamwen42/346/base 2025-12-04T08:56:59.5791540Z * [new branch] gh/williamwen42/346/head -> origin/gh/williamwen42/346/head 2025-12-04T08:56:59.5793043Z * [new branch] gh/williamwen42/346/orig -> origin/gh/williamwen42/346/orig 2025-12-04T08:56:59.5795363Z * [new branch] gh/williamwen42/347/base -> origin/gh/williamwen42/347/base 2025-12-04T08:56:59.5796818Z * [new branch] gh/williamwen42/347/head -> origin/gh/williamwen42/347/head 2025-12-04T08:56:59.5798353Z * [new branch] gh/williamwen42/347/orig -> origin/gh/williamwen42/347/orig 2025-12-04T08:56:59.5800460Z * [new branch] gh/williamwen42/348/base -> origin/gh/williamwen42/348/base 2025-12-04T08:56:59.5801971Z * [new branch] gh/williamwen42/348/head -> origin/gh/williamwen42/348/head 2025-12-04T08:56:59.5803457Z * [new branch] gh/williamwen42/348/orig -> origin/gh/williamwen42/348/orig 2025-12-04T08:56:59.5805472Z * [new branch] gh/williamwen42/349/base -> origin/gh/williamwen42/349/base 2025-12-04T08:56:59.5807031Z * [new branch] gh/williamwen42/349/head -> origin/gh/williamwen42/349/head 2025-12-04T08:56:59.5808688Z * [new branch] gh/williamwen42/349/orig -> origin/gh/williamwen42/349/orig 2025-12-04T08:56:59.5810860Z * [new branch] gh/williamwen42/350/base -> origin/gh/williamwen42/350/base 2025-12-04T08:56:59.5812480Z * [new branch] gh/williamwen42/350/head -> origin/gh/williamwen42/350/head 2025-12-04T08:56:59.5814076Z * [new branch] gh/williamwen42/350/orig -> origin/gh/williamwen42/350/orig 2025-12-04T08:56:59.5816511Z * [new branch] gh/williamwen42/351/base -> origin/gh/williamwen42/351/base 2025-12-04T08:56:59.5817909Z * [new branch] gh/williamwen42/351/head -> origin/gh/williamwen42/351/head 2025-12-04T08:56:59.5819506Z * [new branch] gh/williamwen42/351/orig -> origin/gh/williamwen42/351/orig 2025-12-04T08:56:59.5821700Z * [new branch] gh/williamwen42/352/base -> origin/gh/williamwen42/352/base 2025-12-04T08:56:59.5823293Z * [new branch] gh/williamwen42/352/head -> origin/gh/williamwen42/352/head 2025-12-04T08:56:59.5824812Z * [new branch] gh/williamwen42/352/orig -> origin/gh/williamwen42/352/orig 2025-12-04T08:56:59.5827036Z * [new branch] gh/williamwen42/353/base -> origin/gh/williamwen42/353/base 2025-12-04T08:56:59.5828630Z * [new branch] gh/williamwen42/353/head -> origin/gh/williamwen42/353/head 2025-12-04T08:56:59.5830339Z * [new branch] gh/williamwen42/353/orig -> origin/gh/williamwen42/353/orig 2025-12-04T08:56:59.5832428Z * [new branch] gh/williamwen42/354/base -> origin/gh/williamwen42/354/base 2025-12-04T08:56:59.5834105Z * [new branch] gh/williamwen42/354/head -> origin/gh/williamwen42/354/head 2025-12-04T08:56:59.5835636Z * [new branch] gh/williamwen42/354/orig -> origin/gh/williamwen42/354/orig 2025-12-04T08:56:59.5837758Z * [new branch] gh/williamwen42/355/base -> origin/gh/williamwen42/355/base 2025-12-04T08:56:59.5839269Z * [new branch] gh/williamwen42/355/head -> origin/gh/williamwen42/355/head 2025-12-04T08:56:59.5840779Z * [new branch] gh/williamwen42/355/orig -> origin/gh/williamwen42/355/orig 2025-12-04T08:56:59.5842928Z * [new branch] gh/williamwen42/356/base -> origin/gh/williamwen42/356/base 2025-12-04T08:56:59.5844494Z * [new branch] gh/williamwen42/356/head -> origin/gh/williamwen42/356/head 2025-12-04T08:56:59.5846483Z * [new branch] gh/williamwen42/356/orig -> origin/gh/williamwen42/356/orig 2025-12-04T08:56:59.5848614Z * [new branch] gh/williamwen42/357/base -> origin/gh/williamwen42/357/base 2025-12-04T08:56:59.5850239Z * [new branch] gh/williamwen42/357/head -> origin/gh/williamwen42/357/head 2025-12-04T08:56:59.5851812Z * [new branch] gh/williamwen42/357/orig -> origin/gh/williamwen42/357/orig 2025-12-04T08:56:59.5854177Z * [new branch] gh/williamwen42/358/base -> origin/gh/williamwen42/358/base 2025-12-04T08:56:59.5855825Z * [new branch] gh/williamwen42/358/head -> origin/gh/williamwen42/358/head 2025-12-04T08:56:59.5857482Z * [new branch] gh/williamwen42/358/orig -> origin/gh/williamwen42/358/orig 2025-12-04T08:56:59.5860041Z * [new branch] gh/xmfan/169/base -> origin/gh/xmfan/169/base 2025-12-04T08:56:59.5861582Z * [new branch] gh/xmfan/169/head -> origin/gh/xmfan/169/head 2025-12-04T08:56:59.5863883Z * [new branch] gh/xmfan/170/base -> origin/gh/xmfan/170/base 2025-12-04T08:56:59.5865365Z * [new branch] gh/xmfan/170/head -> origin/gh/xmfan/170/head 2025-12-04T08:56:59.5867393Z * [new branch] gh/xmfan/274/base -> origin/gh/xmfan/274/base 2025-12-04T08:56:59.5868899Z * [new branch] gh/xmfan/274/head -> origin/gh/xmfan/274/head 2025-12-04T08:56:59.5870419Z * [new branch] gh/xmfan/274/orig -> origin/gh/xmfan/274/orig 2025-12-04T08:56:59.5872601Z * [new branch] gh/xmfan/277/base -> origin/gh/xmfan/277/base 2025-12-04T08:56:59.5874076Z * [new branch] gh/xmfan/277/head -> origin/gh/xmfan/277/head 2025-12-04T08:56:59.5875578Z * [new branch] gh/xmfan/277/orig -> origin/gh/xmfan/277/orig 2025-12-04T08:56:59.5877603Z * [new branch] gh/xmfan/301/base -> origin/gh/xmfan/301/base 2025-12-04T08:56:59.5879277Z * [new branch] gh/xmfan/301/head -> origin/gh/xmfan/301/head 2025-12-04T08:56:59.5880679Z * [new branch] gh/xmfan/301/orig -> origin/gh/xmfan/301/orig 2025-12-04T08:56:59.5882666Z * [new branch] gh/xmfan/304/base -> origin/gh/xmfan/304/base 2025-12-04T08:56:59.5884296Z * [new branch] gh/xmfan/304/head -> origin/gh/xmfan/304/head 2025-12-04T08:56:59.5885892Z * [new branch] gh/xmfan/304/orig -> origin/gh/xmfan/304/orig 2025-12-04T08:56:59.5887932Z * [new branch] gh/xmfan/309/base -> origin/gh/xmfan/309/base 2025-12-04T08:56:59.5889426Z * [new branch] gh/xmfan/309/head -> origin/gh/xmfan/309/head 2025-12-04T08:56:59.5891424Z * [new branch] gh/xmfan/309/orig -> origin/gh/xmfan/309/orig 2025-12-04T08:56:59.5893787Z * [new branch] gh/xmfan/310/base -> origin/gh/xmfan/310/base 2025-12-04T08:56:59.5895270Z * [new branch] gh/xmfan/310/head -> origin/gh/xmfan/310/head 2025-12-04T08:56:59.5896777Z * [new branch] gh/xmfan/310/orig -> origin/gh/xmfan/310/orig 2025-12-04T08:56:59.5899269Z * [new branch] gh/xmfan/311/base -> origin/gh/xmfan/311/base 2025-12-04T08:56:59.5900804Z * [new branch] gh/xmfan/311/head -> origin/gh/xmfan/311/head 2025-12-04T08:56:59.5902362Z * [new branch] gh/xmfan/311/orig -> origin/gh/xmfan/311/orig 2025-12-04T08:56:59.5904456Z * [new branch] gh/xmfan/312/base -> origin/gh/xmfan/312/base 2025-12-04T08:56:59.5906029Z * [new branch] gh/xmfan/312/head -> origin/gh/xmfan/312/head 2025-12-04T08:56:59.5907584Z * [new branch] gh/xmfan/312/orig -> origin/gh/xmfan/312/orig 2025-12-04T08:56:59.5909642Z * [new branch] gh/xmfan/313/base -> origin/gh/xmfan/313/base 2025-12-04T08:56:59.5911178Z * [new branch] gh/xmfan/313/head -> origin/gh/xmfan/313/head 2025-12-04T08:56:59.5912696Z * [new branch] gh/xmfan/313/orig -> origin/gh/xmfan/313/orig 2025-12-04T08:56:59.5915309Z * [new branch] gh/xuanzhang816/27/base -> origin/gh/xuanzhang816/27/base 2025-12-04T08:56:59.5916816Z * [new branch] gh/xuanzhang816/27/head -> origin/gh/xuanzhang816/27/head 2025-12-04T08:56:59.5918342Z * [new branch] gh/xuanzhang816/27/orig -> origin/gh/xuanzhang816/27/orig 2025-12-04T08:56:59.5920506Z * [new branch] gh/xuanzhang816/32/base -> origin/gh/xuanzhang816/32/base 2025-12-04T08:56:59.5922012Z * [new branch] gh/xuanzhang816/32/head -> origin/gh/xuanzhang816/32/head 2025-12-04T08:56:59.5923994Z * [new branch] gh/xuanzhang816/32/orig -> origin/gh/xuanzhang816/32/orig 2025-12-04T08:56:59.5926040Z * [new branch] gh/xuanzhang816/33/base -> origin/gh/xuanzhang816/33/base 2025-12-04T08:56:59.5927638Z * [new branch] gh/xuanzhang816/33/head -> origin/gh/xuanzhang816/33/head 2025-12-04T08:56:59.5929227Z * [new branch] gh/xuanzhang816/33/orig -> origin/gh/xuanzhang816/33/orig 2025-12-04T08:56:59.5931577Z * [new branch] gh/xuanzhang816/34/base -> origin/gh/xuanzhang816/34/base 2025-12-04T08:56:59.5933284Z * [new branch] gh/xuanzhang816/34/head -> origin/gh/xuanzhang816/34/head 2025-12-04T08:56:59.5934889Z * [new branch] gh/xuanzhang816/34/orig -> origin/gh/xuanzhang816/34/orig 2025-12-04T08:56:59.5937235Z * [new branch] gh/xuanzhang816/35/base -> origin/gh/xuanzhang816/35/base 2025-12-04T08:56:59.5938725Z * [new branch] gh/xuanzhang816/35/head -> origin/gh/xuanzhang816/35/head 2025-12-04T08:56:59.5940282Z * [new branch] gh/xuanzhang816/35/orig -> origin/gh/xuanzhang816/35/orig 2025-12-04T08:56:59.5942944Z * [new branch] gh/yanbing-j/11/base -> origin/gh/yanbing-j/11/base 2025-12-04T08:56:59.5944466Z * [new branch] gh/yanbing-j/11/head -> origin/gh/yanbing-j/11/head 2025-12-04T08:56:59.5946017Z * [new branch] gh/yanbing-j/11/orig -> origin/gh/yanbing-j/11/orig 2025-12-04T08:56:59.5948133Z * [new branch] gh/yanbing-j/12/base -> origin/gh/yanbing-j/12/base 2025-12-04T08:56:59.5949712Z * [new branch] gh/yanbing-j/12/head -> origin/gh/yanbing-j/12/head 2025-12-04T08:56:59.5951188Z * [new branch] gh/yanbing-j/12/orig -> origin/gh/yanbing-j/12/orig 2025-12-04T08:56:59.5953305Z * [new branch] gh/yanbing-j/13/base -> origin/gh/yanbing-j/13/base 2025-12-04T08:56:59.5954907Z * [new branch] gh/yanbing-j/13/head -> origin/gh/yanbing-j/13/head 2025-12-04T08:56:59.5956477Z * [new branch] gh/yanbing-j/13/orig -> origin/gh/yanbing-j/13/orig 2025-12-04T08:56:59.5958918Z * [new branch] gh/yanbing-j/14/base -> origin/gh/yanbing-j/14/base 2025-12-04T08:56:59.5960478Z * [new branch] gh/yanbing-j/14/head -> origin/gh/yanbing-j/14/head 2025-12-04T08:56:59.5962231Z * [new branch] gh/yanbing-j/14/orig -> origin/gh/yanbing-j/14/orig 2025-12-04T08:56:59.5964265Z * [new branch] gh/yanbing-j/15/base -> origin/gh/yanbing-j/15/base 2025-12-04T08:56:59.5965819Z * [new branch] gh/yanbing-j/15/head -> origin/gh/yanbing-j/15/head 2025-12-04T08:56:59.5967353Z * [new branch] gh/yanbing-j/15/orig -> origin/gh/yanbing-j/15/orig 2025-12-04T08:56:59.5969366Z * [new branch] gh/yanbing-j/18/base -> origin/gh/yanbing-j/18/base 2025-12-04T08:56:59.5970958Z * [new branch] gh/yanbing-j/18/head -> origin/gh/yanbing-j/18/head 2025-12-04T08:56:59.5972486Z * [new branch] gh/yanbing-j/18/orig -> origin/gh/yanbing-j/18/orig 2025-12-04T08:56:59.5974705Z * [new branch] gh/yanbing-j/19/base -> origin/gh/yanbing-j/19/base 2025-12-04T08:56:59.5976311Z * [new branch] gh/yanbing-j/19/head -> origin/gh/yanbing-j/19/head 2025-12-04T08:56:59.5977878Z * [new branch] gh/yanbing-j/19/orig -> origin/gh/yanbing-j/19/orig 2025-12-04T08:56:59.5979870Z * [new branch] gh/yanbing-j/20/base -> origin/gh/yanbing-j/20/base 2025-12-04T08:56:59.5981412Z * [new branch] gh/yanbing-j/20/head -> origin/gh/yanbing-j/20/head 2025-12-04T08:56:59.5982906Z * [new branch] gh/yanbing-j/20/orig -> origin/gh/yanbing-j/20/orig 2025-12-04T08:56:59.5985053Z * [new branch] gh/yanbing-j/21/base -> origin/gh/yanbing-j/21/base 2025-12-04T08:56:59.5986585Z * [new branch] gh/yanbing-j/21/head -> origin/gh/yanbing-j/21/head 2025-12-04T08:56:59.5988759Z * [new branch] gh/yanbing-j/22/base -> origin/gh/yanbing-j/22/base 2025-12-04T08:56:59.5990283Z * [new branch] gh/yanbing-j/22/head -> origin/gh/yanbing-j/22/head 2025-12-04T08:56:59.5991813Z * [new branch] gh/yanbing-j/22/orig -> origin/gh/yanbing-j/22/orig 2025-12-04T08:56:59.5994016Z * [new branch] gh/yanbing-j/23/base -> origin/gh/yanbing-j/23/base 2025-12-04T08:56:59.5995553Z * [new branch] gh/yanbing-j/23/head -> origin/gh/yanbing-j/23/head 2025-12-04T08:56:59.5997179Z * [new branch] gh/yanbing-j/23/orig -> origin/gh/yanbing-j/23/orig 2025-12-04T08:56:59.5999251Z * [new branch] gh/yanbing-j/24/base -> origin/gh/yanbing-j/24/base 2025-12-04T08:56:59.6000791Z * [new branch] gh/yanbing-j/24/head -> origin/gh/yanbing-j/24/head 2025-12-04T08:56:59.6002344Z * [new branch] gh/yanbing-j/24/orig -> origin/gh/yanbing-j/24/orig 2025-12-04T08:56:59.6004649Z * [new branch] gh/yanbing-j/25/base -> origin/gh/yanbing-j/25/base 2025-12-04T08:56:59.6006176Z * [new branch] gh/yanbing-j/25/head -> origin/gh/yanbing-j/25/head 2025-12-04T08:56:59.6007636Z * [new branch] gh/yanbing-j/25/orig -> origin/gh/yanbing-j/25/orig 2025-12-04T08:56:59.6009681Z * [new branch] gh/yanbing-j/26/base -> origin/gh/yanbing-j/26/base 2025-12-04T08:56:59.6011223Z * [new branch] gh/yanbing-j/26/head -> origin/gh/yanbing-j/26/head 2025-12-04T08:56:59.6012806Z * [new branch] gh/yanbing-j/26/orig -> origin/gh/yanbing-j/26/orig 2025-12-04T08:56:59.6015507Z * [new branch] gh/yang-yu-hang/1/base -> origin/gh/yang-yu-hang/1/base 2025-12-04T08:56:59.6017197Z * [new branch] gh/yang-yu-hang/1/head -> origin/gh/yang-yu-hang/1/head 2025-12-04T08:56:59.6018908Z * [new branch] gh/yang-yu-hang/1/orig -> origin/gh/yang-yu-hang/1/orig 2025-12-04T08:56:59.6021369Z * [new branch] gh/yang-yu-hang/2/base -> origin/gh/yang-yu-hang/2/base 2025-12-04T08:56:59.6023237Z * [new branch] gh/yang-yu-hang/2/head -> origin/gh/yang-yu-hang/2/head 2025-12-04T08:56:59.6024995Z * [new branch] gh/yang-yu-hang/2/orig -> origin/gh/yang-yu-hang/2/orig 2025-12-04T08:56:59.6027044Z * [new branch] gh/yang-yu-hang/3/base -> origin/gh/yang-yu-hang/3/base 2025-12-04T08:56:59.6028604Z * [new branch] gh/yang-yu-hang/3/head -> origin/gh/yang-yu-hang/3/head 2025-12-04T08:56:59.6030276Z * [new branch] gh/yang-yu-hang/3/orig -> origin/gh/yang-yu-hang/3/orig 2025-12-04T08:56:59.6032712Z * [new branch] gh/yangw-dev/12/base -> origin/gh/yangw-dev/12/base 2025-12-04T08:56:59.6034262Z * [new branch] gh/yangw-dev/12/head -> origin/gh/yangw-dev/12/head 2025-12-04T08:56:59.6035816Z * [new branch] gh/yangw-dev/12/orig -> origin/gh/yangw-dev/12/orig 2025-12-04T08:56:59.6038063Z * [new branch] gh/yangw-dev/13/base -> origin/gh/yangw-dev/13/base 2025-12-04T08:56:59.6039650Z * [new branch] gh/yangw-dev/13/head -> origin/gh/yangw-dev/13/head 2025-12-04T08:56:59.6041286Z * [new branch] gh/yangw-dev/13/orig -> origin/gh/yangw-dev/13/orig 2025-12-04T08:56:59.6043313Z * [new branch] gh/yangw-dev/14/base -> origin/gh/yangw-dev/14/base 2025-12-04T08:56:59.6044866Z * [new branch] gh/yangw-dev/14/head -> origin/gh/yangw-dev/14/head 2025-12-04T08:56:59.6046426Z * [new branch] gh/yangw-dev/14/orig -> origin/gh/yangw-dev/14/orig 2025-12-04T08:56:59.6048548Z * [new branch] gh/yangw-dev/15/base -> origin/gh/yangw-dev/15/base 2025-12-04T08:56:59.6050027Z * [new branch] gh/yangw-dev/15/head -> origin/gh/yangw-dev/15/head 2025-12-04T08:56:59.6051534Z * [new branch] gh/yangw-dev/15/orig -> origin/gh/yangw-dev/15/orig 2025-12-04T08:56:59.6053765Z * [new branch] gh/yangw-dev/19/base -> origin/gh/yangw-dev/19/base 2025-12-04T08:56:59.6055265Z * [new branch] gh/yangw-dev/19/head -> origin/gh/yangw-dev/19/head 2025-12-04T08:56:59.6056786Z * [new branch] gh/yangw-dev/19/orig -> origin/gh/yangw-dev/19/orig 2025-12-04T08:56:59.6058927Z * [new branch] gh/yangw-dev/26/base -> origin/gh/yangw-dev/26/base 2025-12-04T08:56:59.6060592Z * [new branch] gh/yangw-dev/26/head -> origin/gh/yangw-dev/26/head 2025-12-04T08:56:59.6062466Z * [new branch] gh/yangw-dev/26/orig -> origin/gh/yangw-dev/26/orig 2025-12-04T08:56:59.6066048Z * [new branch] gh/yangw-dev/27/base -> origin/gh/yangw-dev/27/base 2025-12-04T08:56:59.6067581Z * [new branch] gh/yangw-dev/27/head -> origin/gh/yangw-dev/27/head 2025-12-04T08:56:59.6069283Z * [new branch] gh/yangw-dev/27/orig -> origin/gh/yangw-dev/27/orig 2025-12-04T08:56:59.6071717Z * [new branch] gh/ydwu4/292/base -> origin/gh/ydwu4/292/base 2025-12-04T08:56:59.6073144Z * [new branch] gh/ydwu4/292/head -> origin/gh/ydwu4/292/head 2025-12-04T08:56:59.6074649Z * [new branch] gh/ydwu4/292/orig -> origin/gh/ydwu4/292/orig 2025-12-04T08:56:59.6076803Z * [new branch] gh/ydwu4/294/base -> origin/gh/ydwu4/294/base 2025-12-04T08:56:59.6078332Z * [new branch] gh/ydwu4/294/head -> origin/gh/ydwu4/294/head 2025-12-04T08:56:59.6079858Z * [new branch] gh/ydwu4/294/orig -> origin/gh/ydwu4/294/orig 2025-12-04T08:56:59.6082145Z * [new branch] gh/ydwu4/295/base -> origin/gh/ydwu4/295/base 2025-12-04T08:56:59.6083743Z * [new branch] gh/ydwu4/295/head -> origin/gh/ydwu4/295/head 2025-12-04T08:56:59.6085299Z * [new branch] gh/ydwu4/295/orig -> origin/gh/ydwu4/295/orig 2025-12-04T08:56:59.6087298Z * [new branch] gh/ydwu4/296/base -> origin/gh/ydwu4/296/base 2025-12-04T08:56:59.6088818Z * [new branch] gh/ydwu4/296/head -> origin/gh/ydwu4/296/head 2025-12-04T08:56:59.6090464Z * [new branch] gh/ydwu4/296/orig -> origin/gh/ydwu4/296/orig 2025-12-04T08:56:59.6092673Z * [new branch] gh/ydwu4/306/base -> origin/gh/ydwu4/306/base 2025-12-04T08:56:59.6094334Z * [new branch] gh/ydwu4/306/head -> origin/gh/ydwu4/306/head 2025-12-04T08:56:59.6095871Z * [new branch] gh/ydwu4/306/orig -> origin/gh/ydwu4/306/orig 2025-12-04T08:56:59.6097979Z * [new branch] gh/ydwu4/312/base -> origin/gh/ydwu4/312/base 2025-12-04T08:56:59.6099520Z * [new branch] gh/ydwu4/312/head -> origin/gh/ydwu4/312/head 2025-12-04T08:56:59.6101067Z * [new branch] gh/ydwu4/312/orig -> origin/gh/ydwu4/312/orig 2025-12-04T08:56:59.6103268Z * [new branch] gh/ydwu4/322/base -> origin/gh/ydwu4/322/base 2025-12-04T08:56:59.6104807Z * [new branch] gh/ydwu4/322/head -> origin/gh/ydwu4/322/head 2025-12-04T08:56:59.6106350Z * [new branch] gh/ydwu4/322/orig -> origin/gh/ydwu4/322/orig 2025-12-04T08:56:59.6108371Z * [new branch] gh/ydwu4/327/base -> origin/gh/ydwu4/327/base 2025-12-04T08:56:59.6109946Z * [new branch] gh/ydwu4/327/head -> origin/gh/ydwu4/327/head 2025-12-04T08:56:59.6111463Z * [new branch] gh/ydwu4/327/orig -> origin/gh/ydwu4/327/orig 2025-12-04T08:56:59.6113619Z * [new branch] gh/ydwu4/328/base -> origin/gh/ydwu4/328/base 2025-12-04T08:56:59.6115110Z * [new branch] gh/ydwu4/328/head -> origin/gh/ydwu4/328/head 2025-12-04T08:56:59.6116640Z * [new branch] gh/ydwu4/328/orig -> origin/gh/ydwu4/328/orig 2025-12-04T08:56:59.6118568Z * [new branch] gh/ydwu4/329/base -> origin/gh/ydwu4/329/base 2025-12-04T08:56:59.6120102Z * [new branch] gh/ydwu4/329/head -> origin/gh/ydwu4/329/head 2025-12-04T08:56:59.6121616Z * [new branch] gh/ydwu4/329/orig -> origin/gh/ydwu4/329/orig 2025-12-04T08:56:59.6123840Z * [new branch] gh/ydwu4/330/base -> origin/gh/ydwu4/330/base 2025-12-04T08:56:59.6125312Z * [new branch] gh/ydwu4/330/head -> origin/gh/ydwu4/330/head 2025-12-04T08:56:59.6126817Z * [new branch] gh/ydwu4/330/orig -> origin/gh/ydwu4/330/orig 2025-12-04T08:56:59.6128837Z * [new branch] gh/ydwu4/331/base -> origin/gh/ydwu4/331/base 2025-12-04T08:56:59.6130355Z * [new branch] gh/ydwu4/331/head -> origin/gh/ydwu4/331/head 2025-12-04T08:56:59.6132006Z * [new branch] gh/ydwu4/331/orig -> origin/gh/ydwu4/331/orig 2025-12-04T08:56:59.6134039Z * [new branch] gh/ydwu4/332/base -> origin/gh/ydwu4/332/base 2025-12-04T08:56:59.6135509Z * [new branch] gh/ydwu4/332/head -> origin/gh/ydwu4/332/head 2025-12-04T08:56:59.6137044Z * [new branch] gh/ydwu4/332/orig -> origin/gh/ydwu4/332/orig 2025-12-04T08:56:59.6139012Z * [new branch] gh/ydwu4/333/base -> origin/gh/ydwu4/333/base 2025-12-04T08:56:59.6140666Z * [new branch] gh/ydwu4/333/head -> origin/gh/ydwu4/333/head 2025-12-04T08:56:59.6142223Z * [new branch] gh/ydwu4/333/orig -> origin/gh/ydwu4/333/orig 2025-12-04T08:56:59.6144167Z * [new branch] gh/ydwu4/334/base -> origin/gh/ydwu4/334/base 2025-12-04T08:56:59.6145765Z * [new branch] gh/ydwu4/334/head -> origin/gh/ydwu4/334/head 2025-12-04T08:56:59.6147249Z * [new branch] gh/ydwu4/334/orig -> origin/gh/ydwu4/334/orig 2025-12-04T08:56:59.6149185Z * [new branch] gh/ydwu4/335/base -> origin/gh/ydwu4/335/base 2025-12-04T08:56:59.6151125Z * [new branch] gh/ydwu4/335/head -> origin/gh/ydwu4/335/head 2025-12-04T08:56:59.6152706Z * [new branch] gh/ydwu4/335/orig -> origin/gh/ydwu4/335/orig 2025-12-04T08:56:59.6155699Z * [new branch] gh/ydwu4/337/base -> origin/gh/ydwu4/337/base 2025-12-04T08:56:59.6157224Z * [new branch] gh/ydwu4/337/head -> origin/gh/ydwu4/337/head 2025-12-04T08:56:59.6158748Z * [new branch] gh/ydwu4/337/orig -> origin/gh/ydwu4/337/orig 2025-12-04T08:56:59.6160886Z * [new branch] gh/ydwu4/339/base -> origin/gh/ydwu4/339/base 2025-12-04T08:56:59.6162696Z * [new branch] gh/ydwu4/339/head -> origin/gh/ydwu4/339/head 2025-12-04T08:56:59.6164656Z * [new branch] gh/ydwu4/339/orig -> origin/gh/ydwu4/339/orig 2025-12-04T08:56:59.6167331Z * [new branch] gh/yf225/133/base -> origin/gh/yf225/133/base 2025-12-04T08:56:59.6168956Z * [new branch] gh/yf225/133/head -> origin/gh/yf225/133/head 2025-12-04T08:56:59.6171008Z * [new branch] gh/yf225/93/base -> origin/gh/yf225/93/base 2025-12-04T08:56:59.6172633Z * [new branch] gh/yf225/93/head -> origin/gh/yf225/93/head 2025-12-04T08:56:59.6175578Z * [new branch] gh/yifuwang/152/base -> origin/gh/yifuwang/152/base 2025-12-04T08:56:59.6177462Z * [new branch] gh/yifuwang/152/head -> origin/gh/yifuwang/152/head 2025-12-04T08:56:59.6179020Z * [new branch] gh/yifuwang/152/orig -> origin/gh/yifuwang/152/orig 2025-12-04T08:56:59.6182015Z * [new branch] gh/yifuwang/195/base -> origin/gh/yifuwang/195/base 2025-12-04T08:56:59.6183614Z * [new branch] gh/yifuwang/195/head -> origin/gh/yifuwang/195/head 2025-12-04T08:56:59.6185152Z * [new branch] gh/yifuwang/195/orig -> origin/gh/yifuwang/195/orig 2025-12-04T08:56:59.6187826Z * [new branch] gh/yiming0416/1/base -> origin/gh/yiming0416/1/base 2025-12-04T08:56:59.6189374Z * [new branch] gh/yiming0416/1/head -> origin/gh/yiming0416/1/head 2025-12-04T08:56:59.6191367Z * [new branch] gh/yiming0416/2/base -> origin/gh/yiming0416/2/base 2025-12-04T08:56:59.6192956Z * [new branch] gh/yiming0416/2/head -> origin/gh/yiming0416/2/head 2025-12-04T08:56:59.6195563Z * [new branch] gh/yushangdi/1/base -> origin/gh/yushangdi/1/base 2025-12-04T08:56:59.6197152Z * [new branch] gh/yushangdi/1/head -> origin/gh/yushangdi/1/head 2025-12-04T08:56:59.6199399Z * [new branch] gh/yushangdi/10/base -> origin/gh/yushangdi/10/base 2025-12-04T08:56:59.6200851Z * [new branch] gh/yushangdi/10/head -> origin/gh/yushangdi/10/head 2025-12-04T08:56:59.6202338Z * [new branch] gh/yushangdi/10/orig -> origin/gh/yushangdi/10/orig 2025-12-04T08:56:59.6204409Z * [new branch] gh/yushangdi/11/base -> origin/gh/yushangdi/11/base 2025-12-04T08:56:59.6205902Z * [new branch] gh/yushangdi/11/head -> origin/gh/yushangdi/11/head 2025-12-04T08:56:59.6207469Z * [new branch] gh/yushangdi/11/orig -> origin/gh/yushangdi/11/orig 2025-12-04T08:56:59.6209409Z * [new branch] gh/yushangdi/2/base -> origin/gh/yushangdi/2/base 2025-12-04T08:56:59.6210899Z * [new branch] gh/yushangdi/2/head -> origin/gh/yushangdi/2/head 2025-12-04T08:56:59.6213175Z * [new branch] gh/yushangdi/7/base -> origin/gh/yushangdi/7/base 2025-12-04T08:56:59.6214698Z * [new branch] gh/yushangdi/7/head -> origin/gh/yushangdi/7/head 2025-12-04T08:56:59.6216226Z * [new branch] gh/yushangdi/7/orig -> origin/gh/yushangdi/7/orig 2025-12-04T08:56:59.6218567Z * [new branch] gh/yushangdi/8/base -> origin/gh/yushangdi/8/base 2025-12-04T08:56:59.6220253Z * [new branch] gh/yushangdi/8/head -> origin/gh/yushangdi/8/head 2025-12-04T08:56:59.6221853Z * [new branch] gh/yushangdi/8/orig -> origin/gh/yushangdi/8/orig 2025-12-04T08:56:59.6223722Z * [new branch] gh/yushangdi/9/base -> origin/gh/yushangdi/9/base 2025-12-04T08:56:59.6225352Z * [new branch] gh/yushangdi/9/head -> origin/gh/yushangdi/9/head 2025-12-04T08:56:59.6227075Z * [new branch] gh/yushangdi/9/orig -> origin/gh/yushangdi/9/orig 2025-12-04T08:56:59.6229596Z * [new branch] gh/zklaus/19/base -> origin/gh/zklaus/19/base 2025-12-04T08:56:59.6231567Z * [new branch] gh/zklaus/19/head -> origin/gh/zklaus/19/head 2025-12-04T08:56:59.6233131Z * [new branch] gh/zklaus/19/orig -> origin/gh/zklaus/19/orig 2025-12-04T08:56:59.6235211Z * [new branch] gh/zklaus/20/base -> origin/gh/zklaus/20/base 2025-12-04T08:56:59.6236759Z * [new branch] gh/zklaus/20/head -> origin/gh/zklaus/20/head 2025-12-04T08:56:59.6238282Z * [new branch] gh/zklaus/20/orig -> origin/gh/zklaus/20/orig 2025-12-04T08:56:59.6240378Z * [new branch] gh/zklaus/21/base -> origin/gh/zklaus/21/base 2025-12-04T08:56:59.6241937Z * [new branch] gh/zklaus/21/head -> origin/gh/zklaus/21/head 2025-12-04T08:56:59.6243493Z * [new branch] gh/zklaus/21/orig -> origin/gh/zklaus/21/orig 2025-12-04T08:56:59.6245659Z * [new branch] gh/zklaus/22/base -> origin/gh/zklaus/22/base 2025-12-04T08:56:59.6247185Z * [new branch] gh/zklaus/22/head -> origin/gh/zklaus/22/head 2025-12-04T08:56:59.6248749Z * [new branch] gh/zklaus/22/orig -> origin/gh/zklaus/22/orig 2025-12-04T08:56:59.6251389Z * [new branch] gh/zklaus/23/base -> origin/gh/zklaus/23/base 2025-12-04T08:56:59.6253130Z * [new branch] gh/zklaus/23/head -> origin/gh/zklaus/23/head 2025-12-04T08:56:59.6254707Z * [new branch] gh/zklaus/23/orig -> origin/gh/zklaus/23/orig 2025-12-04T08:56:59.6256755Z * [new branch] gh/zklaus/24/base -> origin/gh/zklaus/24/base 2025-12-04T08:56:59.6258293Z * [new branch] gh/zklaus/24/head -> origin/gh/zklaus/24/head 2025-12-04T08:56:59.6259825Z * [new branch] gh/zklaus/24/orig -> origin/gh/zklaus/24/orig 2025-12-04T08:56:59.6262435Z * [new branch] gh/zou3519/1197/base -> origin/gh/zou3519/1197/base 2025-12-04T08:56:59.6264260Z * [new branch] gh/zou3519/1197/head -> origin/gh/zou3519/1197/head 2025-12-04T08:56:59.6265690Z * [new branch] gh/zou3519/1197/orig -> origin/gh/zou3519/1197/orig 2025-12-04T08:56:59.6268038Z * [new branch] gh/zou3519/1199/base -> origin/gh/zou3519/1199/base 2025-12-04T08:56:59.6269725Z * [new branch] gh/zou3519/1199/head -> origin/gh/zou3519/1199/head 2025-12-04T08:56:59.6271318Z * [new branch] gh/zou3519/1199/orig -> origin/gh/zou3519/1199/orig 2025-12-04T08:56:59.6273352Z * [new branch] gh/zou3519/1200/base -> origin/gh/zou3519/1200/base 2025-12-04T08:56:59.6274873Z * [new branch] gh/zou3519/1200/head -> origin/gh/zou3519/1200/head 2025-12-04T08:56:59.6276430Z * [new branch] gh/zou3519/1200/orig -> origin/gh/zou3519/1200/orig 2025-12-04T08:56:59.6278592Z * [new branch] gh/zou3519/1201/base -> origin/gh/zou3519/1201/base 2025-12-04T08:56:59.6280024Z * [new branch] gh/zou3519/1201/head -> origin/gh/zou3519/1201/head 2025-12-04T08:56:59.6281575Z * [new branch] gh/zou3519/1201/orig -> origin/gh/zou3519/1201/orig 2025-12-04T08:56:59.6283526Z * [new branch] gh/zou3519/1202/base -> origin/gh/zou3519/1202/base 2025-12-04T08:56:59.6285093Z * [new branch] gh/zou3519/1202/head -> origin/gh/zou3519/1202/head 2025-12-04T08:56:59.6286606Z * [new branch] gh/zou3519/1202/orig -> origin/gh/zou3519/1202/orig 2025-12-04T08:56:59.6289238Z * [new branch] gh/zpcore/1/base -> origin/gh/zpcore/1/base 2025-12-04T08:56:59.6290780Z * [new branch] gh/zpcore/1/head -> origin/gh/zpcore/1/head 2025-12-04T08:56:59.6293156Z * [new branch] gh/zpcore/11/base -> origin/gh/zpcore/11/base 2025-12-04T08:56:59.6294768Z * [new branch] gh/zpcore/11/head -> origin/gh/zpcore/11/head 2025-12-04T08:56:59.6296362Z * [new branch] gh/zpcore/11/orig -> origin/gh/zpcore/11/orig 2025-12-04T08:56:59.6298763Z * [new branch] gh/zpcore/12/base -> origin/gh/zpcore/12/base 2025-12-04T08:56:59.6300300Z * [new branch] gh/zpcore/12/head -> origin/gh/zpcore/12/head 2025-12-04T08:56:59.6301851Z * [new branch] gh/zpcore/12/orig -> origin/gh/zpcore/12/orig 2025-12-04T08:56:59.6303964Z * [new branch] gh/zpcore/13/base -> origin/gh/zpcore/13/base 2025-12-04T08:56:59.6305442Z * [new branch] gh/zpcore/13/head -> origin/gh/zpcore/13/head 2025-12-04T08:56:59.6306936Z * [new branch] gh/zpcore/13/orig -> origin/gh/zpcore/13/orig 2025-12-04T08:56:59.6309120Z * [new branch] gh/zpcore/14/base -> origin/gh/zpcore/14/base 2025-12-04T08:56:59.6310719Z * [new branch] gh/zpcore/14/head -> origin/gh/zpcore/14/head 2025-12-04T08:56:59.6312249Z * [new branch] gh/zpcore/14/orig -> origin/gh/zpcore/14/orig 2025-12-04T08:56:59.6314493Z * [new branch] gh/zpcore/15/base -> origin/gh/zpcore/15/base 2025-12-04T08:56:59.6315977Z * [new branch] gh/zpcore/15/head -> origin/gh/zpcore/15/head 2025-12-04T08:56:59.6317527Z * [new branch] gh/zpcore/15/orig -> origin/gh/zpcore/15/orig 2025-12-04T08:56:59.6319609Z * [new branch] gh/zpcore/2/base -> origin/gh/zpcore/2/base 2025-12-04T08:56:59.6321166Z * [new branch] gh/zpcore/2/head -> origin/gh/zpcore/2/head 2025-12-04T08:56:59.6323646Z * [new branch] gh/zpcore/21/base -> origin/gh/zpcore/21/base 2025-12-04T08:56:59.6325312Z * [new branch] gh/zpcore/21/head -> origin/gh/zpcore/21/head 2025-12-04T08:56:59.6326870Z * [new branch] gh/zpcore/21/orig -> origin/gh/zpcore/21/orig 2025-12-04T08:56:59.6329276Z * [new branch] gh/zpcore/22/base -> origin/gh/zpcore/22/base 2025-12-04T08:56:59.6330781Z * [new branch] gh/zpcore/22/head -> origin/gh/zpcore/22/head 2025-12-04T08:56:59.6332401Z * [new branch] gh/zpcore/22/orig -> origin/gh/zpcore/22/orig 2025-12-04T08:56:59.6334636Z * [new branch] gh/zpcore/23/base -> origin/gh/zpcore/23/base 2025-12-04T08:56:59.6336157Z * [new branch] gh/zpcore/23/head -> origin/gh/zpcore/23/head 2025-12-04T08:56:59.6337815Z * [new branch] gh/zpcore/23/orig -> origin/gh/zpcore/23/orig 2025-12-04T08:56:59.6339765Z * [new branch] gh/zpcore/24/base -> origin/gh/zpcore/24/base 2025-12-04T08:56:59.6341351Z * [new branch] gh/zpcore/24/head -> origin/gh/zpcore/24/head 2025-12-04T08:56:59.6342936Z * [new branch] gh/zpcore/24/orig -> origin/gh/zpcore/24/orig 2025-12-04T08:56:59.6345155Z * [new branch] gh/zpcore/25/base -> origin/gh/zpcore/25/base 2025-12-04T08:56:59.6346652Z * [new branch] gh/zpcore/25/head -> origin/gh/zpcore/25/head 2025-12-04T08:56:59.6348225Z * [new branch] gh/zpcore/25/orig -> origin/gh/zpcore/25/orig 2025-12-04T08:56:59.6350484Z * [new branch] gh/zpcore/26/base -> origin/gh/zpcore/26/base 2025-12-04T08:56:59.6352095Z * [new branch] gh/zpcore/26/head -> origin/gh/zpcore/26/head 2025-12-04T08:56:59.6353662Z * [new branch] gh/zpcore/26/orig -> origin/gh/zpcore/26/orig 2025-12-04T08:56:59.6355780Z * [new branch] gh/zpcore/27/base -> origin/gh/zpcore/27/base 2025-12-04T08:56:59.6357428Z * [new branch] gh/zpcore/27/head -> origin/gh/zpcore/27/head 2025-12-04T08:56:59.6358953Z * [new branch] gh/zpcore/27/orig -> origin/gh/zpcore/27/orig 2025-12-04T08:56:59.6361385Z * [new branch] gh/zpcore/28/base -> origin/gh/zpcore/28/base 2025-12-04T08:56:59.6363494Z * [new branch] gh/zpcore/28/head -> origin/gh/zpcore/28/head 2025-12-04T08:56:59.6365059Z * [new branch] gh/zpcore/28/orig -> origin/gh/zpcore/28/orig 2025-12-04T08:56:59.6367437Z * [new branch] gh/zpcore/3/base -> origin/gh/zpcore/3/base 2025-12-04T08:56:59.6368949Z * [new branch] gh/zpcore/3/head -> origin/gh/zpcore/3/head 2025-12-04T08:56:59.6370914Z * [new branch] gh/zpcore/4/base -> origin/gh/zpcore/4/base 2025-12-04T08:56:59.6372696Z * [new branch] gh/zpcore/4/head -> origin/gh/zpcore/4/head 2025-12-04T08:56:59.6374681Z * [new branch] gh/zpcore/5/base -> origin/gh/zpcore/5/base 2025-12-04T08:56:59.6376201Z * [new branch] gh/zpcore/5/head -> origin/gh/zpcore/5/head 2025-12-04T08:56:59.6378166Z * [new branch] gh/zpcore/6/base -> origin/gh/zpcore/6/base 2025-12-04T08:56:59.6379633Z * [new branch] gh/zpcore/6/head -> origin/gh/zpcore/6/head 2025-12-04T08:56:59.6382198Z * [new branch] gh/zpcore/7/base -> origin/gh/zpcore/7/base 2025-12-04T08:56:59.6383612Z * [new branch] gh/zpcore/7/head -> origin/gh/zpcore/7/head 2025-12-04T08:56:59.6385724Z * [new branch] gh/zpcore/8/base -> origin/gh/zpcore/8/base 2025-12-04T08:56:59.6387204Z * [new branch] gh/zpcore/8/head -> origin/gh/zpcore/8/head 2025-12-04T08:56:59.6389017Z * [new branch] google-main -> origin/google-main 2025-12-04T08:56:59.6391123Z * [new branch] guangyey/external_stream -> origin/guangyey/external_stream 2025-12-04T08:56:59.6392765Z * [new branch] guangyey/test_2025 -> origin/guangyey/test_2025 2025-12-04T08:56:59.6395121Z * [new branch] guilhermeleobas/cherry-pick-55d87d9dfd9 -> origin/guilhermeleobas/cherry-pick-55d87d9dfd9 2025-12-04T08:56:59.6397013Z * [new branch] hameerabbasi/complex_tensor_subclass -> origin/hameerabbasi/complex_tensor_subclass 2025-12-04T08:56:59.6398594Z * [new branch] hameerabbasi/fix-ctensor-gradcheck-tests -> origin/hameerabbasi/fix-ctensor-gradcheck-tests 2025-12-04T08:56:59.6400039Z * [new branch] hameerabbasi/gradcheck-allclose -> origin/hameerabbasi/gradcheck-allclose 2025-12-04T08:56:59.6401581Z * [new branch] hc_baseline -> origin/hc_baseline 2025-12-04T08:56:59.6403199Z * [new branch] hhh_rand -> origin/hhh_rand 2025-12-04T08:56:59.6405263Z * [new branch] huba/f1 -> origin/huba/f1 2025-12-04T08:56:59.6407426Z * [new branch] increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test -> origin/increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test 2025-12-04T08:56:59.6408762Z * [new branch] inlining -> origin/inlining 2025-12-04T08:56:59.6410388Z * [new branch] inlining-ezyang -> origin/inlining-ezyang 2025-12-04T08:56:59.6412010Z * [new branch] install-torchao-0.13.0 -> origin/install-torchao-0.13.0 2025-12-04T08:56:59.6414237Z * [new branch] instrument-trunk-pull-linux-with-job-test-filters -> origin/instrument-trunk-pull-linux-with-job-test-filters 2025-12-04T08:56:59.6415620Z * [new branch] invoke-subgraph -> origin/invoke-subgraph 2025-12-04T08:56:59.6417382Z * [new branch] issue#58739 -> origin/issue#58739 2025-12-04T08:56:59.6418965Z * [new branch] jainapurva-patch-1 -> origin/jainapurva-patch-1 2025-12-04T08:56:59.6420959Z * [new branch] jathu/o3 -> origin/jathu/o3 2025-12-04T08:56:59.6422418Z * [new branch] jathu/sve -> origin/jathu/sve 2025-12-04T08:56:59.6424719Z * [new branch] jcaip/test-cusparselt-version-0.6.2 -> origin/jcaip/test-cusparselt-version-0.6.2 2025-12-04T08:56:59.6426179Z * [new branch] jcaip/update-cusparselt-0.6.2 -> origin/jcaip/update-cusparselt-0.6.2 2025-12-04T08:56:59.6428717Z * [new branch] jiannanWang/memorysnapshot_filter -> origin/jiannanWang/memorysnapshot_filter 2025-12-04T08:56:59.6430247Z * [new branch] jiannanWang/profilerstepwarning -> origin/jiannanWang/profilerstepwarning 2025-12-04T08:56:59.6431907Z * [new branch] jithunnair-amd-patch-1 -> origin/jithunnair-amd-patch-1 2025-12-04T08:56:59.6433548Z * [new branch] jithunnair-amd-patch-10 -> origin/jithunnair-amd-patch-10 2025-12-04T08:56:59.6435196Z * [new branch] jithunnair-amd-patch-2 -> origin/jithunnair-amd-patch-2 2025-12-04T08:56:59.6436919Z * [new branch] jithunnair-amd-patch-3 -> origin/jithunnair-amd-patch-3 2025-12-04T08:56:59.6438649Z * [new branch] jithunnair-amd-patch-4 -> origin/jithunnair-amd-patch-4 2025-12-04T08:56:59.6440321Z * [new branch] jithunnair-amd-patch-5 -> origin/jithunnair-amd-patch-5 2025-12-04T08:56:59.6441981Z * [new branch] jithunnair-amd-patch-6 -> origin/jithunnair-amd-patch-6 2025-12-04T08:56:59.6443570Z * [new branch] jithunnair-amd-patch-7 -> origin/jithunnair-amd-patch-7 2025-12-04T08:56:59.6445282Z * [new branch] jithunnair-amd-patch-8 -> origin/jithunnair-amd-patch-8 2025-12-04T08:56:59.6446965Z * [new branch] jithunnair-amd-patch-9 -> origin/jithunnair-amd-patch-9 2025-12-04T08:56:59.6449139Z * [new branch] justinchu/native-qdq -> origin/justinchu/native-qdq 2025-12-04T08:56:59.6451671Z * [new branch] kainan666/xlf_debug -> origin/kainan666/xlf_debug 2025-12-04T08:56:59.6453523Z * [new branch] kainan_test -> origin/kainan_test 2025-12-04T08:56:59.6455108Z * [new branch] larryliu0820-patch-1 -> origin/larryliu0820-patch-1 2025-12-04T08:56:59.6457236Z * [new branch] leslie/test_group_gemm_epilogues -> origin/leslie/test_group_gemm_epilogues 2025-12-04T08:56:59.6459339Z * [new branch] lessw2020/fix_cutlass_cache_error -> origin/lessw2020/fix_cutlass_cache_error 2025-12-04T08:56:59.6461445Z * [new branch] liaoxuan/shm_all_reduce -> origin/liaoxuan/shm_all_reduce 2025-12-04T08:56:59.6464680Z * [new branch] liaoxuan/test_fa_disable_softmax -> origin/liaoxuan/test_fa_disable_softmax 2025-12-04T08:56:59.6466167Z * [new branch] liaoxuan/test_int8_sdpa -> origin/liaoxuan/test_int8_sdpa 2025-12-04T08:56:59.6467673Z * [new branch] llama4-stable -> origin/llama4-stable 2025-12-04T08:56:59.6470326Z * [new branch] lts/release/1.8 -> origin/lts/release/1.8 2025-12-04T08:56:59.6472436Z * [new branch] lucaskabela/#94773 -> origin/lucaskabela/#94773 2025-12-04T08:56:59.6473864Z * [new branch] lucaskabela/fix_164876 -> origin/lucaskabela/fix_164876 2025-12-04T08:56:59.6475320Z * [new branch] lucaskabela/flop_counter -> origin/lucaskabela/flop_counter 2025-12-04T08:56:59.6476822Z * [new branch] lucaskabela/func_under_decomp -> origin/lucaskabela/func_under_decomp 2025-12-04T08:56:59.6478318Z * [new branch] lucaskabela/functional_in_dynamo -> origin/lucaskabela/functional_in_dynamo 2025-12-04T08:56:59.6479940Z * [new branch] lucaskabela/install_params_as_graph_attr -> origin/lucaskabela/install_params_as_graph_attr 2025-12-04T08:56:59.6481729Z * [new branch] lucaskabela/parameters_as_graph_attr -> origin/lucaskabela/parameters_as_graph_attr 2025-12-04T08:56:59.6483778Z * [new branch] lucaskabela/remove_aot_dispatcher_metadata -> origin/lucaskabela/remove_aot_dispatcher_metadata 2025-12-04T08:56:59.6485290Z * [new branch] lucaskabela/rnn_decomp -> origin/lucaskabela/rnn_decomp 2025-12-04T08:56:59.6486878Z * [new branch] lucaskabela/typing_backends -> origin/lucaskabela/typing_backends 2025-12-04T08:56:59.6488491Z * [new branch] lucaskabela/typing_ctx_manager -> origin/lucaskabela/typing_ctx_manager 2025-12-04T08:56:59.6490066Z * [new branch] lucaskabela/typing_nn_module -> origin/lucaskabela/typing_nn_module 2025-12-04T08:56:59.6491637Z * [new branch] lucaskabela/typing_user_defined -> origin/lucaskabela/typing_user_defined 2025-12-04T08:56:59.6493455Z * [new branch] lucaskabela/typing_variables -> origin/lucaskabela/typing_variables 2025-12-04T08:56:59.6495009Z * [new branch] lucaskabela/typing_variables_dicts -> origin/lucaskabela/typing_variables_dicts 2025-12-04T08:56:59.6496670Z * [new branch] lucaskabela/typing_variables_functions -> origin/lucaskabela/typing_variables_functions 2025-12-04T08:56:59.6498125Z * [new branch] lucaskabela/typing_variables_lists -> origin/lucaskabela/typing_variables_lists 2025-12-04T08:56:59.6500251Z * [new branch] lw/torch_box_by_ref -> origin/lw/torch_box_by_ref 2025-12-04T08:56:59.6501927Z * [new branch] main -> origin/main 2025-12-04T08:56:59.6503689Z * [new branch] malfet-patch-1 -> origin/malfet-patch-1 2025-12-04T08:56:59.6505367Z * [new branch] malfet-patch-2 -> origin/malfet-patch-2 2025-12-04T08:56:59.6507035Z * [new branch] malfet-patch-3 -> origin/malfet-patch-3 2025-12-04T08:56:59.6508814Z * [new branch] malfet-patch-4 -> origin/malfet-patch-4 2025-12-04T08:56:59.6510520Z * [new branch] malfet-patch-5 -> origin/malfet-patch-5 2025-12-04T08:56:59.6512249Z * [new branch] malfet-patch-6 -> origin/malfet-patch-6 2025-12-04T08:56:59.6513859Z * [new branch] malfet-patch-7 -> origin/malfet-patch-7 2025-12-04T08:56:59.6515527Z * [new branch] malfet-patch-8 -> origin/malfet-patch-8 2025-12-04T08:56:59.6517703Z * [new branch] malfet/add-3.14-ci -> origin/malfet/add-3.14-ci 2025-12-04T08:56:59.6519360Z * [new branch] malfet/be-do-not-make-typos-in-build-artifacts -> origin/malfet/be-do-not-make-typos-in-build-artifacts 2025-12-04T08:56:59.6520877Z * [new branch] malfet/be-move-more-settings-to-checkout-pytorch -> origin/malfet/be-move-more-settings-to-checkout-pytorch 2025-12-04T08:56:59.6522581Z * [new branch] malfet/be-remove-misisng-neon-headers -> origin/malfet/be-remove-misisng-neon-headers 2025-12-04T08:56:59.6524324Z * [new branch] malfet/mps-implement-col2im -> origin/malfet/mps-implement-col2im 2025-12-04T08:56:59.6526510Z * [new branch] manuel/aoti_metal_shimify-thread_safe -> origin/manuel/aoti_metal_shimify-thread_safe 2025-12-04T08:56:59.6527925Z * [new branch] manuel/inductor_link_openmp -> origin/manuel/inductor_link_openmp 2025-12-04T08:56:59.6529966Z * [new branch] masnesral/metaconda -> origin/masnesral/metaconda 2025-12-04T08:56:59.6531696Z * [new branch] mem_profiler_flaky_fix -> origin/mem_profiler_flaky_fix 2025-12-04T08:56:59.6533551Z * [new branch] mem_profiler_stack_trace -> origin/mem_profiler_stack_trace 2025-12-04T08:56:59.6535146Z * [new branch] memory_profiler_stack -> origin/memory_profiler_stack 2025-12-04T08:56:59.6536775Z * [new branch] metascroy-patch-1 -> origin/metascroy-patch-1 2025-12-04T08:56:59.6538372Z * [new branch] mingw_posix -> origin/mingw_posix 2025-12-04T08:56:59.6540508Z * [new branch] mlazos/S429861-debug -> origin/mlazos/S429861-debug 2025-12-04T08:56:59.6541966Z * [new branch] mlazos/aa -> origin/mlazos/aa 2025-12-04T08:56:59.6543480Z * [new branch] mlazos/acts -> origin/mlazos/acts 2025-12-04T08:56:59.6545014Z * [new branch] mlazos/arg-renames -> origin/mlazos/arg-renames 2025-12-04T08:56:59.6546526Z * [new branch] mlazos/bad-cudagraphs -> origin/mlazos/bad-cudagraphs 2025-12-04T08:56:59.6547985Z * [new branch] mlazos/baseline-graph-breaks -> origin/mlazos/baseline-graph-breaks 2025-12-04T08:56:59.6549409Z * [new branch] mlazos/beta-tensor -> origin/mlazos/beta-tensor 2025-12-04T08:56:59.6550857Z * [new branch] mlazos/buffers -> origin/mlazos/buffers 2025-12-04T08:56:59.6552243Z * [new branch] mlazos/buffers2 -> origin/mlazos/buffers2 2025-12-04T08:56:59.6553984Z * [new branch] mlazos/buffers3 -> origin/mlazos/buffers3 2025-12-04T08:56:59.6555870Z * [new branch] mlazos/bwd -> origin/mlazos/bwd 2025-12-04T08:56:59.6557377Z * [new branch] mlazos/combo-test -> origin/mlazos/combo-test 2025-12-04T08:56:59.6558974Z * [new branch] mlazos/ctx-cleanup -> origin/mlazos/ctx-cleanup 2025-12-04T08:56:59.6560526Z * [new branch] mlazos/cuda-cmd-log -> origin/mlazos/cuda-cmd-log 2025-12-04T08:56:59.6562257Z * [new branch] mlazos/cudagraph-tests -> origin/mlazos/cudagraph-tests 2025-12-04T08:56:59.6564077Z * [new branch] mlazos/cudagraphs-measurement -> origin/mlazos/cudagraphs-measurement 2025-12-04T08:56:59.6565637Z * [new branch] mlazos/cutlass-test -> origin/mlazos/cutlass-test 2025-12-04T08:56:59.6567191Z * [new branch] mlazos/cutlass-topo-bug -> origin/mlazos/cutlass-topo-bug 2025-12-04T08:56:59.6568921Z * [new branch] mlazos/dataclass-proxy -> origin/mlazos/dataclass-proxy 2025-12-04T08:56:59.6570357Z * [new branch] mlazos/dc-attrs -> origin/mlazos/dc-attrs 2025-12-04T08:56:59.6571883Z * [new branch] mlazos/dc-helion -> origin/mlazos/dc-helion 2025-12-04T08:56:59.6573608Z * [new branch] mlazos/dict-fix -> origin/mlazos/dict-fix 2025-12-04T08:56:59.6575200Z * [new branch] mlazos/disable-tf -> origin/mlazos/disable-tf 2025-12-04T08:56:59.6576704Z * [new branch] mlazos/dupe-fix -> origin/mlazos/dupe-fix 2025-12-04T08:56:59.6578312Z * [new branch] mlazos/dyn-batch -> origin/mlazos/dyn-batch 2025-12-04T08:56:59.6579870Z * [new branch] mlazos/evt -> origin/mlazos/evt 2025-12-04T08:56:59.6582099Z * [new branch] mlazos/extract-examples -> origin/mlazos/extract-examples 2025-12-04T08:56:59.6583650Z * [new branch] mlazos/foreach-op -> origin/mlazos/foreach-op 2025-12-04T08:56:59.6585165Z * [new branch] mlazos/fp8 -> origin/mlazos/fp8 2025-12-04T08:56:59.6586759Z * [new branch] mlazos/fp8-bias -> origin/mlazos/fp8-bias 2025-12-04T08:56:59.6588818Z * [new branch] mlazos/fp8-bias-fusion -> origin/mlazos/fp8-bias-fusion 2025-12-04T08:56:59.6590458Z * [new branch] mlazos/fp8-fixes -> origin/mlazos/fp8-fixes 2025-12-04T08:56:59.6592157Z * [new branch] mlazos/freezing -> origin/mlazos/freezing 2025-12-04T08:56:59.6593738Z * [new branch] mlazos/h-comp -> origin/mlazos/h-comp 2025-12-04T08:56:59.6595422Z * [new branch] mlazos/h-comp2 -> origin/mlazos/h-comp2 2025-12-04T08:56:59.6597005Z * [new branch] mlazos/hash-hop -> origin/mlazos/hash-hop 2025-12-04T08:56:59.6598616Z * [new branch] mlazos/hc -> origin/mlazos/hc 2025-12-04T08:56:59.6600217Z * [new branch] mlazos/hc-cycles -> origin/mlazos/hc-cycles 2025-12-04T08:56:59.6601783Z * [new branch] mlazos/hc-fixes -> origin/mlazos/hc-fixes 2025-12-04T08:56:59.6603328Z * [new branch] mlazos/hc-fixes3 -> origin/mlazos/hc-fixes3 2025-12-04T08:56:59.6604958Z * [new branch] mlazos/hc-fixes4 -> origin/mlazos/hc-fixes4 2025-12-04T08:56:59.6606497Z * [new branch] mlazos/hc-hf -> origin/mlazos/hc-hf 2025-12-04T08:56:59.6608127Z * [new branch] mlazos/hc-mut -> origin/mlazos/hc-mut 2025-12-04T08:56:59.6609672Z * [new branch] mlazos/hc10 -> origin/mlazos/hc10 2025-12-04T08:56:59.6611226Z * [new branch] mlazos/hc11 -> origin/mlazos/hc11 2025-12-04T08:56:59.6612924Z * [new branch] mlazos/hc12 -> origin/mlazos/hc12 2025-12-04T08:56:59.6614437Z * [new branch] mlazos/hc13 -> origin/mlazos/hc13 2025-12-04T08:56:59.6616027Z * [new branch] mlazos/hc14 -> origin/mlazos/hc14 2025-12-04T08:56:59.6617572Z * [new branch] mlazos/hc15 -> origin/mlazos/hc15 2025-12-04T08:56:59.6619115Z * [new branch] mlazos/hc2 -> origin/mlazos/hc2 2025-12-04T08:56:59.6620678Z * [new branch] mlazos/hc4 -> origin/mlazos/hc4 2025-12-04T08:56:59.6622221Z * [new branch] mlazos/hc5 -> origin/mlazos/hc5 2025-12-04T08:56:59.6623738Z * [new branch] mlazos/hc6 -> origin/mlazos/hc6 2025-12-04T08:56:59.6625333Z * [new branch] mlazos/hc7 -> origin/mlazos/hc7 2025-12-04T08:56:59.6626837Z * [new branch] mlazos/hc8 -> origin/mlazos/hc8 2025-12-04T08:56:59.6628485Z * [new branch] mlazos/hc9 -> origin/mlazos/hc9 2025-12-04T08:56:59.6630022Z * [new branch] mlazos/hc_baseline2 -> origin/mlazos/hc_baseline2 2025-12-04T08:56:59.6631471Z * [new branch] mlazos/inductor-streams -> origin/mlazos/inductor-streams 2025-12-04T08:56:59.6633001Z * [new branch] mlazos/main -> origin/mlazos/main 2025-12-04T08:56:59.6634562Z * [new branch] mlazos/mcg2 -> origin/mlazos/mcg2 2025-12-04T08:56:59.6636170Z * [new branch] mlazos/meta-guards -> origin/mlazos/meta-guards 2025-12-04T08:56:59.6638420Z * [new branch] mlazos/mlazos/foreach-map-adam -> origin/mlazos/mlazos/foreach-map-adam 2025-12-04T08:56:59.6640002Z * [new branch] mlazos/mlazos/tf-mode-backup -> origin/mlazos/mlazos/tf-mode-backup 2025-12-04T08:56:59.6641523Z * [new branch] mlazos/mod-fix -> origin/mlazos/mod-fix 2025-12-04T08:56:59.6643130Z * [new branch] mlazos/mode-fix -> origin/mlazos/mode-fix 2025-12-04T08:56:59.6644694Z * [new branch] mlazos/offsets -> origin/mlazos/offsets 2025-12-04T08:56:59.6646176Z * [new branch] mlazos/overguarding -> origin/mlazos/overguarding 2025-12-04T08:56:59.6647817Z * [new branch] mlazos/proxy-ctors -> origin/mlazos/proxy-ctors 2025-12-04T08:56:59.6649385Z * [new branch] mlazos/quant-fix -> origin/mlazos/quant-fix 2025-12-04T08:56:59.6650956Z * [new branch] mlazos/resnet-fix -> origin/mlazos/resnet-fix 2025-12-04T08:56:59.6652625Z * [new branch] mlazos/rm-buf-names -> origin/mlazos/rm-buf-names 2025-12-04T08:56:59.6654286Z * [new branch] mlazos/rm-code -> origin/mlazos/rm-code 2025-12-04T08:56:59.6655841Z * [new branch] mlazos/rm-spam -> origin/mlazos/rm-spam 2025-12-04T08:56:59.6657513Z * [new branch] mlazos/rtp -> origin/mlazos/rtp 2025-12-04T08:56:59.6659214Z * [new branch] mlazos/static-idx-dbg -> origin/mlazos/static-idx-dbg 2025-12-04T08:56:59.6660820Z * [new branch] mlazos/static-inputs-log -> origin/mlazos/static-inputs-log 2025-12-04T08:56:59.6662360Z * [new branch] mlazos/stests -> origin/mlazos/stests 2025-12-04T08:56:59.6664016Z * [new branch] mlazos/stream-ops -> origin/mlazos/stream-ops 2025-12-04T08:56:59.6665565Z * [new branch] mlazos/td-fix2 -> origin/mlazos/td-fix2 2025-12-04T08:56:59.6667148Z * [new branch] mlazos/tensor-hasattr2 -> origin/mlazos/tensor-hasattr2 2025-12-04T08:56:59.6668676Z * [new branch] mlazos/test -> origin/mlazos/test 2025-12-04T08:56:59.6670248Z * [new branch] mlazos/tf-mode -> origin/mlazos/tf-mode 2025-12-04T08:56:59.6671864Z * [new branch] mlazos/tf-mode-backup2 -> origin/mlazos/tf-mode-backup2 2025-12-04T08:56:59.6673488Z * [new branch] mlazos/tf-mode-reland -> origin/mlazos/tf-mode-reland 2025-12-04T08:56:59.6675118Z * [new branch] mlazos/tf-mode-reland2 -> origin/mlazos/tf-mode-reland2 2025-12-04T08:56:59.6676709Z * [new branch] mlazos/tf-mode-reland3 -> origin/mlazos/tf-mode-reland3 2025-12-04T08:56:59.6678242Z * [new branch] mlazos/triton-no-epi -> origin/mlazos/triton-no-epi 2025-12-04T08:56:59.6679857Z * [new branch] mlazos/tune-proto -> origin/mlazos/tune-proto 2025-12-04T08:56:59.6681431Z * [new branch] mlazos/tuple-fixes -> origin/mlazos/tuple-fixes 2025-12-04T08:56:59.6683133Z * [new branch] mlazos/tuple-fixes2 -> origin/mlazos/tuple-fixes2 2025-12-04T08:56:59.6684837Z * [new branch] mlazos/tuple-handling -> origin/mlazos/tuple-handling 2025-12-04T08:56:59.6686517Z * [new branch] mlazos/user-stream-base -> origin/mlazos/user-stream-base 2025-12-04T08:56:59.6688515Z * [new branch] mlazos/user-streams -> origin/mlazos/user-streams 2025-12-04T08:56:59.6690133Z * [new branch] mlazos/user-streams-backup -> origin/mlazos/user-streams-backup 2025-12-04T08:56:59.6691666Z * [new branch] mlazos/user-streams-backup2 -> origin/mlazos/user-streams-backup2 2025-12-04T08:56:59.6693345Z * [new branch] mlazos/vary-beta -> origin/mlazos/vary-beta 2025-12-04T08:56:59.6695320Z * [new branch] mlazos/vary-beta2 -> origin/mlazos/vary-beta2 2025-12-04T08:56:59.6696912Z * [new branch] mlazos/weird-perf1 -> origin/mlazos/weird-perf1 2025-12-04T08:56:59.6698528Z * [new branch] mm_out_dtype_compile -> origin/mm_out_dtype_compile 2025-12-04T08:56:59.6700160Z * [new branch] module-shim -> origin/module-shim 2025-12-04T08:56:59.6701767Z * [new branch] move_config -> origin/move_config 2025-12-04T08:56:59.6703908Z * [new branch] msaroufim/reduce -> origin/msaroufim/reduce 2025-12-04T08:56:59.6705991Z * [new branch] mtia/basic-cmake -> origin/mtia/basic-cmake 2025-12-04T08:56:59.6708196Z * [new branch] mwizak/fix-triton-block-shape -> origin/mwizak/fix-triton-block-shape 2025-12-04T08:56:59.6709788Z * [new branch] my_varlen_backup -> origin/my_varlen_backup 2025-12-04T08:56:59.6711475Z * [new branch] nativert_num_outputs -> origin/nativert_num_outputs 2025-12-04T08:56:59.6713038Z * [new branch] new-codegen -> origin/new-codegen 2025-12-04T08:56:59.6714715Z * [new branch] newtest-base -> origin/newtest-base 2025-12-04T08:56:59.6716800Z * [new branch] ngimel/addmm_dtype -> origin/ngimel/addmm_dtype 2025-12-04T08:56:59.6718242Z * [new branch] ngimel/div_inv -> origin/ngimel/div_inv 2025-12-04T08:56:59.6719695Z * [new branch] ngimel/error_index_list -> origin/ngimel/error_index_list 2025-12-04T08:56:59.6721107Z * [new branch] ngimel/gather_grid -> origin/ngimel/gather_grid 2025-12-04T08:56:59.6722701Z * [new branch] ngimel/gather_grid_release -> origin/ngimel/gather_grid_release 2025-12-04T08:56:59.6724096Z * [new branch] ngimel/gg_new -> origin/ngimel/gg_new 2025-12-04T08:56:59.6725579Z * [new branch] ngimel/hostalloc -> origin/ngimel/hostalloc 2025-12-04T08:56:59.6727026Z * [new branch] ngimel/storage_id -> origin/ngimel/storage_id 2025-12-04T08:56:59.6728620Z * [new branch] nightly -> origin/nightly 2025-12-04T08:56:59.6730941Z * [new branch] nikitaved/addmm_1_rowcol_lt_path_check -> origin/nikitaved/addmm_1_rowcol_lt_path_check 2025-12-04T08:56:59.6732572Z * [new branch] nikitaved/addmm_epilogue_fusions_2d_bias -> origin/nikitaved/addmm_epilogue_fusions_2d_bias 2025-12-04T08:56:59.6734127Z * [new branch] nikitaved/addmm_epilogue_fusions_inductor -> origin/nikitaved/addmm_epilogue_fusions_inductor 2025-12-04T08:56:59.6735891Z * [new branch] nikitaved/addmm_epilogue_fusions_scratch -> origin/nikitaved/addmm_epilogue_fusions_scratch 2025-12-04T08:56:59.6737700Z * [new branch] nikitaved/grad_addmm_epilogue_fusions -> origin/nikitaved/grad_addmm_epilogue_fusions 2025-12-04T08:56:59.6739588Z * [new branch] nikitaved/simpler_can_use_32bit_index -> origin/nikitaved/simpler_can_use_32bit_index 2025-12-04T08:56:59.6741067Z * [new branch] nikitaved/test -> origin/nikitaved/test 2025-12-04T08:56:59.6743370Z * [new branch] nmacchioni-perf-test-async-autotune -> origin/nmacchioni-perf-test-async-autotune 2025-12-04T08:56:59.6744641Z * [new branch] no_distributed_log_spew -> origin/no_distributed_log_spew 2025-12-04T08:56:59.6746871Z * [new branch] nofun-hack -> origin/nofun-hack 2025-12-04T08:56:59.6748488Z * [new branch] norm_bench -> origin/norm_bench 2025-12-04T08:56:59.6751106Z * [new branch] nullplay/fuse_matmul -> origin/nullplay/fuse_matmul 2025-12-04T08:56:59.6752696Z * [new branch] nullplay_fuse_matmul -> origin/nullplay_fuse_matmul 2025-12-04T08:56:59.6754333Z * [new branch] optimizer_test -> origin/optimizer_test 2025-12-04T08:56:59.6757026Z * [new branch] orig/release/1.10 -> origin/orig/release/1.10 2025-12-04T08:56:59.6758682Z * [new branch] orig/release/1.11 -> origin/orig/release/1.11 2025-12-04T08:56:59.6760257Z * [new branch] orig/release/1.12 -> origin/orig/release/1.12 2025-12-04T08:56:59.6762331Z * [new branch] orig/release/1.13 -> origin/orig/release/1.13 2025-12-04T08:56:59.6763980Z * [new branch] orig/release/1.6 -> origin/orig/release/1.6 2025-12-04T08:56:59.6765719Z * [new branch] orig/release/1.7 -> origin/orig/release/1.7 2025-12-04T08:56:59.6767286Z * [new branch] orig/release/1.8 -> origin/orig/release/1.8 2025-12-04T08:56:59.6768901Z * [new branch] orig/release/1.9 -> origin/orig/release/1.9 2025-12-04T08:56:59.6770477Z * [new branch] orig/release/2.0 -> origin/orig/release/2.0 2025-12-04T08:56:59.6772073Z * [new branch] orig/release/2.1 -> origin/orig/release/2.1 2025-12-04T08:56:59.6773798Z * [new branch] orig/release/2.2 -> origin/orig/release/2.2 2025-12-04T08:56:59.6775308Z * [new branch] orig/release/2.3 -> origin/orig/release/2.3 2025-12-04T08:56:59.6776978Z * [new branch] orig/release/2.4 -> origin/orig/release/2.4 2025-12-04T08:56:59.6778413Z * [new branch] orig/release/2.5 -> origin/orig/release/2.5 2025-12-04T08:56:59.6780009Z * [new branch] orig/release/2.6 -> origin/orig/release/2.6 2025-12-04T08:56:59.6781970Z * [new branch] orig/release/2.7 -> origin/orig/release/2.7 2025-12-04T08:56:59.6784147Z * [new branch] orig/release/2.8 -> origin/orig/release/2.8 2025-12-04T08:56:59.6785680Z * [new branch] orig/release/2.9 -> origin/orig/release/2.9 2025-12-04T08:56:59.6789547Z * [new branch] origin/gh/fxdawnn/1/base -> origin/origin/gh/fxdawnn/1/base 2025-12-04T08:56:59.6791044Z * [new branch] origin/gh/fxdawnn/1/orig -> origin/origin/gh/fxdawnn/1/orig 2025-12-04T08:56:59.6794063Z * [new branch] origin/gh/zpcore/14/orig -> origin/origin/gh/zpcore/14/orig 2025-12-04T08:56:59.6795801Z * [new branch] oulgen-patch-1 -> origin/oulgen-patch-1 2025-12-04T08:56:59.6797581Z * [new branch] oulgen-patch-2 -> origin/oulgen-patch-2 2025-12-04T08:56:59.6799701Z * [new branch] oulgen-patch-3 -> origin/oulgen-patch-3 2025-12-04T08:56:59.6801467Z * [new branch] oulgen-patch-4 -> origin/oulgen-patch-4 2025-12-04T08:56:59.6803109Z * [new branch] padded-tensor -> origin/padded-tensor 2025-12-04T08:56:59.6804758Z * [new branch] pca2 -> origin/pca2 2025-12-04T08:56:59.6806515Z * [new branch] per_channel_backup -> origin/per_channel_backup 2025-12-04T08:56:59.6808176Z * [new branch] perf_ops -> origin/perf_ops 2025-12-04T08:56:59.6809815Z * [new branch] perf_ops_2_9 -> origin/perf_ops_2_9 2025-12-04T08:56:59.6811739Z * [new branch] pianpwk-patch-1 -> origin/pianpwk-patch-1 2025-12-04T08:56:59.6813895Z * [new branch] pianpwk/__draft_debug_mode -> origin/pianpwk/__draft_debug_mode 2025-12-04T08:56:59.6815349Z * [new branch] pianpwk/_debug_mode_for_triton_draft -> origin/pianpwk/_debug_mode_for_triton_draft 2025-12-04T08:56:59.6816787Z * [new branch] pianpwk/_debug_nn_module_compile -> origin/pianpwk/_debug_nn_module_compile 2025-12-04T08:56:59.6818652Z * [new branch] pianpwk/_draft_triton_11_3 -> origin/pianpwk/_draft_triton_11_3 2025-12-04T08:56:59.6820138Z * [new branch] pianpwk/_manual_bucket_draft -> origin/pianpwk/_manual_bucket_draft 2025-12-04T08:56:59.6821884Z * [new branch] pianpwk/_profile_w_dispatch_keys -> origin/pianpwk/_profile_w_dispatch_keys 2025-12-04T08:56:59.6823872Z * [new branch] pianpwk/_super_draft_debug_mode -> origin/pianpwk/_super_draft_debug_mode 2025-12-04T08:56:59.6825537Z * [new branch] pianpwk/_unbacked_local_shard_size -> origin/pianpwk/_unbacked_local_shard_size 2025-12-04T08:56:59.6827117Z * [new branch] pianpwk/anomaly_tb -> origin/pianpwk/anomaly_tb 2025-12-04T08:56:59.6828685Z * [new branch] pianpwk/auto_fx_annotate -> origin/pianpwk/auto_fx_annotate 2025-12-04T08:56:59.6830299Z * [new branch] pianpwk/backed_size_oblivious_export -> origin/pianpwk/backed_size_oblivious_export 2025-12-04T08:56:59.6831917Z * [new branch] pianpwk/bert_dynamic_perf -> origin/pianpwk/bert_dynamic_perf 2025-12-04T08:56:59.6833548Z * [new branch] pianpwk/debug_fwd_stack_traces -> origin/pianpwk/debug_fwd_stack_traces 2025-12-04T08:56:59.6835161Z * [new branch] pianpwk/debug_hash_tensor -> origin/pianpwk/debug_hash_tensor 2025-12-04T08:56:59.6836782Z * [new branch] pianpwk/debug_mode_annotate -> origin/pianpwk/debug_mode_annotate 2025-12-04T08:56:59.6838287Z * [new branch] pianpwk/debug_mode_defaults -> origin/pianpwk/debug_mode_defaults 2025-12-04T08:56:59.6839951Z * [new branch] pianpwk/debug_mode_hacks -> origin/pianpwk/debug_mode_hacks 2025-12-04T08:56:59.6841614Z * [new branch] pianpwk/debug_mode_opcall_refactor -> origin/pianpwk/debug_mode_opcall_refactor 2025-12-04T08:56:59.6843142Z * [new branch] pianpwk/debug_mode_show_ids -> origin/pianpwk/debug_mode_show_ids 2025-12-04T08:56:59.6844709Z * [new branch] pianpwk/debug_mode_triton -> origin/pianpwk/debug_mode_triton 2025-12-04T08:56:59.6846350Z * [new branch] pianpwk/debug_show_stack_trace -> origin/pianpwk/debug_show_stack_trace 2025-12-04T08:56:59.6847973Z * [new branch] pianpwk/debug_wait_on_collective -> origin/pianpwk/debug_wait_on_collective 2025-12-04T08:56:59.6850007Z * [new branch] pianpwk/debugmode_compile_tf -> origin/pianpwk/debugmode_compile_tf 2025-12-04T08:56:59.6851712Z * [new branch] pianpwk/dispatch_key_debugging_for_debug -> origin/pianpwk/dispatch_key_debugging_for_debug 2025-12-04T08:56:59.6853541Z * [new branch] pianpwk/draft_debug_mode_tfcompile -> origin/pianpwk/draft_debug_mode_tfcompile 2025-12-04T08:56:59.6855098Z * [new branch] pianpwk/draft_multikernel_nn -> origin/pianpwk/draft_multikernel_nn 2025-12-04T08:56:59.6856672Z * [new branch] pianpwk/draft_multikernel_status_10_5 -> origin/pianpwk/draft_multikernel_status_10_5 2025-12-04T08:56:59.6858260Z * [new branch] pianpwk/dtensor_custom_chunk -> origin/pianpwk/dtensor_custom_chunk 2025-12-04T08:56:59.6859964Z * [new branch] pianpwk/dtensor_unbacked_keypath -> origin/pianpwk/dtensor_unbacked_keypath 2025-12-04T08:56:59.6861602Z * [new branch] pianpwk/event_list_tree -> origin/pianpwk/event_list_tree 2025-12-04T08:56:59.6865018Z * [new branch] pianpwk/false_numel_refs -> origin/pianpwk/false_numel_refs 2025-12-04T08:56:59.6866467Z * [new branch] pianpwk/maybe_guard_rel -> origin/pianpwk/maybe_guard_rel 2025-12-04T08:56:59.6868054Z * [new branch] pianpwk/multikernel_hints_draft -> origin/pianpwk/multikernel_hints_draft 2025-12-04T08:56:59.6869738Z * [new branch] pianpwk/no_size_oblivious_slice_scat -> origin/pianpwk/no_size_oblivious_slice_scat 2025-12-04T08:56:59.6871317Z * [new branch] pianpwk/oblivious_reshape_view_better -> origin/pianpwk/oblivious_reshape_view_better 2025-12-04T08:56:59.6872814Z * [new branch] pianpwk/pre_forward_hook -> origin/pianpwk/pre_forward_hook 2025-12-04T08:56:59.6874419Z * [new branch] pianpwk/skip_python_keys_alternate -> origin/pianpwk/skip_python_keys_alternate 2025-12-04T08:56:59.6876041Z * [new branch] pianpwk/skip_python_keys_in_guards -> origin/pianpwk/skip_python_keys_in_guards 2025-12-04T08:56:59.6877578Z * [new branch] pianpwk/sym_tokens_draft -> origin/pianpwk/sym_tokens_draft 2025-12-04T08:56:59.6879122Z * [new branch] pianpwk/symint_one_hot -> origin/pianpwk/symint_one_hot 2025-12-04T08:56:59.6880890Z * [new branch] pianpwk/test_pointwise_guard_or_false -> origin/pianpwk/test_pointwise_guard_or_false 2025-12-04T08:56:59.6882445Z * [new branch] pianpwk/totally_draft_sym_wrap -> origin/pianpwk/totally_draft_sym_wrap 2025-12-04T08:56:59.6883943Z * [new branch] pianpwk/try_dumb_stuff -> origin/pianpwk/try_dumb_stuff 2025-12-04T08:56:59.6885499Z * [new branch] pianpwk/try_dumb_stuff_2 -> origin/pianpwk/try_dumb_stuff_2 2025-12-04T08:56:59.6887206Z * [new branch] pianpwk/unbacked_dtensor_mm -> origin/pianpwk/unbacked_dtensor_mm 2025-12-04T08:56:59.6888771Z * [new branch] pianpwk/unbacked_tracing_12_2 -> origin/pianpwk/unbacked_tracing_12_2 2025-12-04T08:56:59.6890318Z * [new branch] pianpwk/user_symints -> origin/pianpwk/user_symints 2025-12-04T08:56:59.6891855Z * [new branch] pianpwk/wan21_reshape -> origin/pianpwk/wan21_reshape 2025-12-04T08:56:59.6895161Z * [new branch] piz/fix_partial_backward_1112 -> origin/piz/fix_partial_backward_1112 2025-12-04T08:56:59.6896520Z * [new branch] piz/prop_cache_clean -> origin/piz/prop_cache_clean 2025-12-04T08:56:59.6898152Z * [new branch] pool-separate -> origin/pool-separate 2025-12-04T08:56:59.6899818Z * [new branch] pr-156087 -> origin/pr-156087 2025-12-04T08:56:59.6902029Z * [new branch] pr/131860 -> origin/pr/131860 2025-12-04T08:56:59.6903677Z * [new branch] predispatch_to -> origin/predispatch_to 2025-12-04T08:56:59.6905354Z * [new branch] protect-c17 -> origin/protect-c17 2025-12-04T08:56:59.6907050Z * [new branch] pt-opt-cuda3 -> origin/pt-opt-cuda3 2025-12-04T08:56:59.6909224Z * [new branch] python_compiled_autograd -> origin/python_compiled_autograd 2025-12-04T08:56:59.6911394Z * [new branch] q1l1/fix_device_moved_constant_type_unknown -> origin/q1l1/fix_device_moved_constant_type_unknown 2025-12-04T08:56:59.6913008Z * [new branch] q1l1/fix_wrong_default_type_for_kernel_call_args -> origin/q1l1/fix_wrong_default_type_for_kernel_call_args 2025-12-04T08:56:59.6915337Z * [new branch] qchip/export-D54134695 -> origin/qchip/export-D54134695 2025-12-04T08:56:59.6917073Z * [new branch] quote-pytest_cache -> origin/quote-pytest_cache 2025-12-04T08:56:59.6919019Z * [new branch] reland-accgrad-stream-warn -> origin/reland-accgrad-stream-warn 2025-12-04T08:56:59.6921208Z * [new branch] release/1.10 -> origin/release/1.10 2025-12-04T08:56:59.6922977Z * [new branch] release/1.11 -> origin/release/1.11 2025-12-04T08:56:59.6924491Z * [new branch] release/1.12 -> origin/release/1.12 2025-12-04T08:56:59.6926020Z * [new branch] release/1.13 -> origin/release/1.13 2025-12-04T08:56:59.6927496Z * [new branch] release/1.4 -> origin/release/1.4 2025-12-04T08:56:59.6928920Z * [new branch] release/1.4.1 -> origin/release/1.4.1 2025-12-04T08:56:59.6930449Z * [new branch] release/1.5 -> origin/release/1.5 2025-12-04T08:56:59.6932057Z * [new branch] release/1.6 -> origin/release/1.6 2025-12-04T08:56:59.6933746Z * [new branch] release/1.7 -> origin/release/1.7 2025-12-04T08:56:59.6935365Z * [new branch] release/1.8 -> origin/release/1.8 2025-12-04T08:56:59.6936914Z * [new branch] release/1.9 -> origin/release/1.9 2025-12-04T08:56:59.6938469Z * [new branch] release/2.0 -> origin/release/2.0 2025-12-04T08:56:59.6940123Z * [new branch] release/2.1 -> origin/release/2.1 2025-12-04T08:56:59.6941835Z * [new branch] release/2.2 -> origin/release/2.2 2025-12-04T08:56:59.6943746Z * [new branch] release/2.3 -> origin/release/2.3 2025-12-04T08:56:59.6945708Z * [new branch] release/2.4 -> origin/release/2.4 2025-12-04T08:56:59.6947863Z * [new branch] release/2.5 -> origin/release/2.5 2025-12-04T08:56:59.6949507Z * [new branch] release/2.6 -> origin/release/2.6 2025-12-04T08:56:59.6951167Z * [new branch] release/2.7 -> origin/release/2.7 2025-12-04T08:56:59.6952957Z * [new branch] release/2.8 -> origin/release/2.8 2025-12-04T08:56:59.6954604Z * [new branch] release/2.9 -> origin/release/2.9 2025-12-04T08:56:59.6956350Z * [new branch] release_notes -> origin/release_notes 2025-12-04T08:56:59.6958040Z * [new branch] remove_pyinterpreter -> origin/remove_pyinterpreter 2025-12-04T08:56:59.6959985Z * [new branch] replace-pytorch-labs-20250812-195836 -> origin/replace-pytorch-labs-20250812-195836 2025-12-04T08:56:59.6961517Z * [new branch] replace-pytorch-labs-20250812-200248 -> origin/replace-pytorch-labs-20250812-200248 2025-12-04T08:56:59.6963149Z * [new branch] replace-pytorch-labs-20250812-200324 -> origin/replace-pytorch-labs-20250812-200324 2025-12-04T08:56:59.6964649Z * [new branch] replace-pytorch-labs-20250812-204020 -> origin/replace-pytorch-labs-20250812-204020 2025-12-04T08:56:59.6967670Z * [new branch] revert-131069-gh/krzysztofjordan/1/head -> origin/revert-131069-gh/krzysztofjordan/1/head 2025-12-04T08:56:59.6970656Z * [new branch] revert-131469-gh/andrewor14/51/head -> origin/revert-131469-gh/andrewor14/51/head 2025-12-04T08:56:59.6973867Z * [new branch] revert-152361-gh/fadara01/1/head -> origin/revert-152361-gh/fadara01/1/head 2025-12-04T08:56:59.6976781Z * [new branch] revert-156870-gh/skarjala/3/head -> origin/revert-156870-gh/skarjala/3/head 2025-12-04T08:56:59.6978605Z * [new branch] revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ -> origin/revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ 2025-12-04T08:56:59.6980064Z * [new branch] revert-hoo-invoke-subgraph -> origin/revert-hoo-invoke-subgraph 2025-12-04T08:56:59.6981818Z * [new branch] revert_always_build_distributed -> origin/revert_always_build_distributed 2025-12-04T08:56:59.6983417Z * [new branch] rms_norm_patch -> origin/rms_norm_patch 2025-12-04T08:56:59.6985575Z * [new branch] ruisi/fix_all_to_all_estimation -> origin/ruisi/fix_all_to_all_estimation 2025-12-04T08:56:59.6987181Z * [new branch] ruisi/fix_comm_estimation -> origin/ruisi/fix_comm_estimation 2025-12-04T08:56:59.6988628Z * [new branch] ruisi/fix_dynamic_shape_estimation -> origin/ruisi/fix_dynamic_shape_estimation 2025-12-04T08:56:59.6990041Z * [new branch] ruisi/fix_llama3_autobucketing -> origin/ruisi/fix_llama3_autobucketing 2025-12-04T08:56:59.6991813Z * [new branch] ruisi/fix_manual_bucketing_ep_pass -> origin/ruisi/fix_manual_bucketing_ep_pass 2025-12-04T08:56:59.6993740Z * [new branch] ruisi/manual_bucket_pass -> origin/ruisi/manual_bucket_pass 2025-12-04T08:56:59.6996116Z * [new branch] ryanguo99/cleanup-dynamo-expected-failures -> origin/ryanguo99/cleanup-dynamo-expected-failures 2025-12-04T08:56:59.6997476Z * [new branch] ryanguo99/fix-closure-var -> origin/ryanguo99/fix-closure-var 2025-12-04T08:56:59.6999555Z * [new branch] rzou/faketensor_bench -> origin/rzou/faketensor_bench 2025-12-04T08:56:59.7001005Z * [new branch] rzou/njt -> origin/rzou/njt 2025-12-04T08:56:59.7002588Z * [new branch] rzou/pca -> origin/rzou/pca 2025-12-04T08:56:59.7004058Z * [new branch] rzou/realprop -> origin/rzou/realprop 2025-12-04T08:56:59.7005742Z * [new branch] samplevllm -> origin/samplevllm 2025-12-04T08:56:59.7008199Z * [new branch] sanchitintel/weird_thing_with_test_cpu_select_algorithm -> origin/sanchitintel/weird_thing_with_test_cpu_select_algorithm 2025-12-04T08:56:59.7009710Z * [new branch] sapling-pr-archive-SS-JIA -> origin/sapling-pr-archive-SS-JIA 2025-12-04T08:56:59.7011464Z * [new branch] sapling-pr-archive-tushar00jain -> origin/sapling-pr-archive-tushar00jain 2025-12-04T08:56:59.7013017Z * [new branch] save -> origin/save 2025-12-04T08:56:59.7014615Z * [new branch] scaled_mm -> origin/scaled_mm 2025-12-04T08:56:59.7016338Z * [new branch] scan_attempt -> origin/scan_attempt 2025-12-04T08:56:59.7018473Z * [new branch] sdym/2.5.1 -> origin/sdym/2.5.1 2025-12-04T08:56:59.7020171Z * [new branch] sekyondaMeta-dynamoconfig-fix -> origin/sekyondaMeta-dynamoconfig-fix 2025-12-04T08:56:59.7022187Z * [new branch] shengf/fx-xform-perf -> origin/shengf/fx-xform-perf 2025-12-04T08:56:59.7024858Z * [new branch] shoumikhin-patch-1 -> origin/shoumikhin-patch-1 2025-12-04T08:56:59.7025833Z * [new branch] solve-accuracy-fix -> origin/solve-accuracy-fix 2025-12-04T08:56:59.7027246Z * [new branch] some_rocm_inductor_skips -> origin/some_rocm_inductor_skips 2025-12-04T08:56:59.7029910Z * [new branch] soulitzer/stash-tls-ac -> origin/soulitzer/stash-tls-ac 2025-12-04T08:56:59.7031491Z * [new branch] sparse-mm-bf16-support -> origin/sparse-mm-bf16-support 2025-12-04T08:56:59.7033116Z * [new branch] starterTaskUpdate -> origin/starterTaskUpdate 2025-12-04T08:56:59.7034789Z * [new branch] suo -> origin/suo 2025-12-04T08:56:59.7036437Z * [new branch] sve-poc -> origin/sve-poc 2025-12-04T08:56:59.7038215Z * [new branch] switch-bn -> origin/switch-bn 2025-12-04T08:56:59.7039971Z * [new branch] sy_annotation_in_autograd_hop -> origin/sy_annotation_in_autograd_hop 2025-12-04T08:56:59.7041583Z * [new branch] sy_aot_eager_record -> origin/sy_aot_eager_record 2025-12-04T08:56:59.7043188Z * [new branch] sy_custom_bucketing -> origin/sy_custom_bucketing 2025-12-04T08:56:59.7044758Z * [new branch] sy_debug_mode_test -> origin/sy_debug_mode_test 2025-12-04T08:56:59.7046532Z * [new branch] sy_deserialize -> origin/sy_deserialize 2025-12-04T08:56:59.7048075Z * [new branch] sy_dump_gm_code -> origin/sy_dump_gm_code 2025-12-04T08:56:59.7049686Z * [new branch] sy_exp -> origin/sy_exp 2025-12-04T08:56:59.7051330Z * [new branch] sy_export_annotation -> origin/sy_export_annotation 2025-12-04T08:56:59.7053092Z * [new branch] sy_invoke_subgraph -> origin/sy_invoke_subgraph 2025-12-04T08:56:59.7054702Z * [new branch] sy_kernel_bw_name -> origin/sy_kernel_bw_name 2025-12-04T08:56:59.7056374Z * [new branch] sy_multi_arch -> origin/sy_multi_arch 2025-12-04T08:56:59.7057942Z * [new branch] sy_nn_module_stack -> origin/sy_nn_module_stack 2025-12-04T08:56:59.7059847Z * [new branch] sy_original_dtensor -> origin/sy_original_dtensor 2025-12-04T08:56:59.7061515Z * [new branch] sy_profiler_cia -> origin/sy_profiler_cia 2025-12-04T08:56:59.7063484Z * [new branch] symm_mem_sync -> origin/symm_mem_sync 2025-12-04T08:56:59.7065143Z * [new branch] sympy-bottleneck-repro -> origin/sympy-bottleneck-repro 2025-12-04T08:56:59.7066890Z * [new branch] tensordict_integration -> origin/tensordict_integration 2025-12-04T08:56:59.7068653Z * [new branch] test-move-conda-builds -> origin/test-move-conda-builds 2025-12-04T08:56:59.7070242Z * [new branch] test-old -> origin/test-old 2025-12-04T08:56:59.7072345Z * [new branch] test/bmm_heur -> origin/test/bmm_heur 2025-12-04T08:56:59.7074474Z * [new branch] tianren/customOp_autotune_fix -> origin/tianren/customOp_autotune_fix 2025-12-04T08:56:59.7076012Z * [new branch] tianren/customOp_enable_max_autotune -> origin/tianren/customOp_enable_max_autotune 2025-12-04T08:56:59.7077899Z * [new branch] tianren/customOp_fusion -> origin/tianren/customOp_fusion 2025-12-04T08:56:59.7079503Z * [new branch] tianren/customop_collectiveop_benchmark -> origin/tianren/customop_collectiveop_benchmark 2025-12-04T08:56:59.7081462Z * [new branch] tianren/customop_collectiveop_benchmark_fix -> origin/tianren/customop_collectiveop_benchmark_fix 2025-12-04T08:56:59.7083417Z * [new branch] tianren/customop_dynamic_config -> origin/tianren/customop_dynamic_config 2025-12-04T08:56:59.7085028Z * [new branch] tianren/dynamic_range_input -> origin/tianren/dynamic_range_input 2025-12-04T08:56:59.7086664Z * [new branch] tianren/dynamic_range_input_fix -> origin/tianren/dynamic_range_input_fix 2025-12-04T08:56:59.7088232Z * [new branch] tianren/dynamic_range_input_merge -> origin/tianren/dynamic_range_input_merge 2025-12-04T08:56:59.7089864Z * [new branch] tianren/flex_paged_attn_fix_temp -> origin/tianren/flex_paged_attn_fix_temp 2025-12-04T08:56:59.7091530Z * [new branch] tianren/fx_codegen_dump -> origin/tianren/fx_codegen_dump 2025-12-04T08:56:59.7093268Z * [new branch] tianren/symmetric_memory -> origin/tianren/symmetric_memory 2025-12-04T08:56:59.7094788Z * [new branch] tianren/test -> origin/tianren/test 2025-12-04T08:56:59.7096447Z * [new branch] tidy_performance_cyy -> origin/tidy_performance_cyy 2025-12-04T08:56:59.7098012Z * [new branch] tmp -> origin/tmp 2025-12-04T08:56:59.7099688Z * [new branch] torchtitan_ep -> origin/torchtitan_ep 2025-12-04T08:56:59.7101356Z * [new branch] torchtitan_integration -> origin/torchtitan_integration 2025-12-04T08:56:59.7103049Z * [new branch] trace_fsdp_torchtune_lora -> origin/trace_fsdp_torchtune_lora 2025-12-04T08:56:59.7104822Z * [new branch] traceable_fsdp_unit_tests -> origin/traceable_fsdp_unit_tests 2025-12-04T08:56:59.7106342Z * [new branch] tree_loop_vec_base -> origin/tree_loop_vec_base 2025-12-04T08:56:59.7107962Z * [new branch] triton_kernel -> origin/triton_kernel 2025-12-04T08:56:59.7109705Z * [new branch] tt_pkg_1908 -> origin/tt_pkg_1908 2025-12-04T08:56:59.7111337Z * [new branch] type_dec -> origin/type_dec 2025-12-04T08:56:59.7113007Z * [new branch] udate-sphinx-dependancies -> origin/udate-sphinx-dependancies 2025-12-04T08:56:59.7115269Z * [new branch] update-audio-commit-hash/17630256502-1803-1 -> origin/update-audio-commit-hash/17630256502-1803-1 2025-12-04T08:56:59.7116883Z * [new branch] update-audio-commit-hash/19087141161-1916-1 -> origin/update-audio-commit-hash/19087141161-1916-1 2025-12-04T08:56:59.7118361Z * [new branch] update-audio-commit-hash/19250643381-1929-1 -> origin/update-audio-commit-hash/19250643381-1929-1 2025-12-04T08:56:59.7119822Z * [new branch] update-audio-commit-hash/19397724337-1935-1 -> origin/update-audio-commit-hash/19397724337-1935-1 2025-12-04T08:56:59.7121290Z * [new branch] update-audio-commit-hash/19555670148-1941-1 -> origin/update-audio-commit-hash/19555670148-1941-1 2025-12-04T08:56:59.7123321Z * [new branch] update-audio-commit-hash/19750627930-1946-1 -> origin/update-audio-commit-hash/19750627930-1946-1 2025-12-04T08:56:59.7125475Z * [new branch] update-triton-commit-hash/13663274526-1487-2 -> origin/update-triton-commit-hash/13663274526-1487-2 2025-12-04T08:56:59.7127583Z * [new branch] update-vision-commit-hash/19087141161-1916-1 -> origin/update-vision-commit-hash/19087141161-1916-1 2025-12-04T08:56:59.7129100Z * [new branch] update-vision-commit-hash/19184897099-1925-1 -> origin/update-vision-commit-hash/19184897099-1925-1 2025-12-04T08:56:59.7130517Z * [new branch] update-vision-commit-hash/19250643381-1929-1 -> origin/update-vision-commit-hash/19250643381-1929-1 2025-12-04T08:56:59.7132025Z * [new branch] update-vision-commit-hash/19381328640-1934-1 -> origin/update-vision-commit-hash/19381328640-1934-1 2025-12-04T08:56:59.7133717Z * [new branch] update-vision-commit-hash/19485237164-1938-1 -> origin/update-vision-commit-hash/19485237164-1938-1 2025-12-04T08:56:59.7135857Z * [new branch] update-vllm-commit-hash/18451675449-1879-1 -> origin/update-vllm-commit-hash/18451675449-1879-1 2025-12-04T08:56:59.7137474Z * [new branch] update-vllm-dockerfile -> origin/update-vllm-dockerfile 2025-12-04T08:56:59.7139837Z * [new branch] update-xla-commit-hash/19224287370-211-1 -> origin/update-xla-commit-hash/19224287370-211-1 2025-12-04T08:56:59.7141316Z * [new branch] update-xla-commit-hash/19422028566-212-1 -> origin/update-xla-commit-hash/19422028566-212-1 2025-12-04T08:56:59.7142816Z * [new branch] update-xla-commit-hash/19626841311-213-1 -> origin/update-xla-commit-hash/19626841311-213-1 2025-12-04T08:56:59.7144505Z * [new branch] update_docs_torch_multinomial_issue#125388 -> origin/update_docs_torch_multinomial_issue#125388 2025-12-04T08:56:59.7146045Z * [new branch] update_operator_readme -> origin/update_operator_readme 2025-12-04T08:56:59.7147735Z * [new branch] update_slow_tests_1722488736 -> origin/update_slow_tests_1722488736 2025-12-04T08:56:59.7149411Z * [new branch] update_slow_tests_1722879173 -> origin/update_slow_tests_1722879173 2025-12-04T08:56:59.7151043Z * [new branch] update_slow_tests_1762155677 -> origin/update_slow_tests_1762155677 2025-12-04T08:56:59.7152730Z * [new branch] update_slow_tests_1763365283 -> origin/update_slow_tests_1763365283 2025-12-04T08:56:59.7154479Z * [new branch] update_submodule_FBGEMM -> origin/update_submodule_FBGEMM 2025-12-04T08:56:59.7156061Z * [new branch] update_submodule_kineto -> origin/update_submodule_kineto 2025-12-04T08:56:59.7157642Z * [new branch] update_submodule_tensorpipe -> origin/update_submodule_tensorpipe 2025-12-04T08:56:59.7159298Z * [new branch] upload-tests-for-autorevert -> origin/upload-tests-for-autorevert 2025-12-04T08:56:59.7160974Z * [new branch] v0.1.2 -> origin/v0.1.2 2025-12-04T08:56:59.7162980Z * [new branch] v1.0.1 -> origin/v1.0.1 2025-12-04T08:56:59.7164742Z * [new branch] v1.0.3 -> origin/v1.0.3 2025-12-04T08:56:59.7166443Z * [new branch] v1.1.0 -> origin/v1.1.0 2025-12-04T08:56:59.7168441Z * [new branch] v1.2.0 -> origin/v1.2.0 2025-12-04T08:56:59.7170178Z * [new branch] v1.3.0 -> origin/v1.3.0 2025-12-04T08:56:59.7171848Z * [new branch] v1.3.1 -> origin/v1.3.1 2025-12-04T08:56:59.7173721Z * [new branch] validate_fn -> origin/validate_fn 2025-12-04T08:56:59.7175506Z * [new branch] validations_2.6 -> origin/validations_2.6 2025-12-04T08:56:59.7177237Z * [new branch] validations_2.8 -> origin/validations_2.8 2025-12-04T08:56:59.7178824Z * [new branch] varlen-api -> origin/varlen-api 2025-12-04T08:56:59.7180472Z * [new branch] varlen-api-backup -> origin/varlen-api-backup 2025-12-04T08:56:59.7182079Z * [new branch] varlen_batch_invariance -> origin/varlen_batch_invariance 2025-12-04T08:56:59.7184034Z * [new branch] viable/strict -> origin/viable/strict 2025-12-04T08:56:59.7186366Z * [new branch] vishal9-team/dtensor_parallelism_toy -> origin/vishal9-team/dtensor_parallelism_toy 2025-12-04T08:56:59.7187901Z * [new branch] vllmbuildci -> origin/vllmbuildci 2025-12-04T08:56:59.7189726Z * [new branch] vllmpin -> origin/vllmpin 2025-12-04T08:56:59.7191506Z * [new branch] vscode-recommend-pyrefly -> origin/vscode-recommend-pyrefly 2025-12-04T08:56:59.7193693Z * [new branch] wdvr-patch-1 -> origin/wdvr-patch-1 2025-12-04T08:56:59.7195807Z * [new branch] wdvr/iss_145259 -> origin/wdvr/iss_145259 2025-12-04T08:56:59.7197924Z * [new branch] whc/pei -> origin/whc/pei 2025-12-04T08:56:59.7199470Z * [new branch] whc/pp_fix -> origin/whc/pp_fix 2025-12-04T08:56:59.7201022Z * [new branch] whc/sharding -> origin/whc/sharding 2025-12-04T08:56:59.7202513Z * [new branch] whc/sharding2 -> origin/whc/sharding2 2025-12-04T08:56:59.7204013Z * [new branch] whc/uneven -> origin/whc/uneven 2025-12-04T08:56:59.7205784Z * [new branch] whc/uneven-merge -> origin/whc/uneven-merge 2025-12-04T08:56:59.7207387Z * [new branch] win_warnings -> origin/win_warnings 2025-12-04T08:56:59.7209077Z * [new branch] windows_libtorch_free -> origin/windows_libtorch_free 2025-12-04T08:56:59.7210670Z * [new branch] xmfan-war -> origin/xmfan-war 2025-12-04T08:56:59.7212864Z * [new branch] xmfan/ca_0516 -> origin/xmfan/ca_0516 2025-12-04T08:56:59.7214304Z * [new branch] xmfan/ca_1051b93192 -> origin/xmfan/ca_1051b93192 2025-12-04T08:56:59.7215862Z * [new branch] xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 -> origin/xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 2025-12-04T08:56:59.7217164Z * [new branch] xmfan/ca_5a2be192d1 -> origin/xmfan/ca_5a2be192d1 2025-12-04T08:56:59.7218766Z * [new branch] xmfan/ca_9d59b516e9 -> origin/xmfan/ca_9d59b516e9 2025-12-04T08:56:59.7220167Z * [new branch] xmfan/ca_apr8 -> origin/xmfan/ca_apr8 2025-12-04T08:56:59.7221563Z * [new branch] xmfan/ca_base -> origin/xmfan/ca_base 2025-12-04T08:56:59.7223326Z * [new branch] xmfan/ca_dynamic -> origin/xmfan/ca_dynamic 2025-12-04T08:56:59.7225224Z * [new branch] xmfan/ca_fix_dyn -> origin/xmfan/ca_fix_dyn 2025-12-04T08:56:59.7226768Z * [new branch] xmfan/ca_fix_lowering -> origin/xmfan/ca_fix_lowering 2025-12-04T08:56:59.7228332Z * [new branch] xmfan/ca_fix_polyfills -> origin/xmfan/ca_fix_polyfills 2025-12-04T08:56:59.7229749Z * [new branch] xmfan/ca_jan3 -> origin/xmfan/ca_jan3 2025-12-04T08:56:59.7231314Z * [new branch] xmfan/ca_jun18 -> origin/xmfan/ca_jun18 2025-12-04T08:56:59.7232868Z * [new branch] xmfan/ca_jun24 -> origin/xmfan/ca_jun24 2025-12-04T08:56:59.7234445Z * [new branch] xmfan/ca_nested -> origin/xmfan/ca_nested 2025-12-04T08:56:59.7236098Z * [new branch] xmfan/ca_overhead -> origin/xmfan/ca_overhead 2025-12-04T08:56:59.7237803Z * [new branch] xmfan/ca_overhead_0eba7e5451 -> origin/xmfan/ca_overhead_0eba7e5451 2025-12-04T08:56:59.7239308Z * [new branch] xmfan/cacu_jun18 -> origin/xmfan/cacu_jun18 2025-12-04T08:56:59.7241310Z * [new branch] xmfan/cacu_jun19 -> origin/xmfan/cacu_jun19 2025-12-04T08:56:59.7242917Z * [new branch] xmfan/cacu_jun4 -> origin/xmfan/cacu_jun4 2025-12-04T08:56:59.7244509Z * [new branch] xmfan/disable_duck_shape -> origin/xmfan/disable_duck_shape 2025-12-04T08:56:59.7246104Z * [new branch] xmfan/fca_cpp_node_passthrough -> origin/xmfan/fca_cpp_node_passthrough 2025-12-04T08:56:59.7247799Z * [new branch] xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T08:56:59.7249395Z * [new branch] xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T08:56:59.7250746Z * [new branch] xmfan/single_step -> origin/xmfan/single_step 2025-12-04T08:56:59.7252375Z * [new branch] xmfan/sth_0829 -> origin/xmfan/sth_0829 2025-12-04T08:56:59.7254180Z * [new branch] xmfan/test -> origin/xmfan/test 2025-12-04T08:56:59.7256360Z * [new branch] yguo/debug-0226-constexpr -> origin/yguo/debug-0226-constexpr 2025-12-04T08:56:59.7257725Z * [new branch] yguo/new_latest_changes -> origin/yguo/new_latest_changes 2025-12-04T08:56:59.7259197Z * [new branch] yguo/patch_constexpr_changes -> origin/yguo/patch_constexpr_changes 2025-12-04T08:56:59.7261303Z * [new branch] yiming/bootcamp -> origin/yiming/bootcamp 2025-12-04T08:56:59.7264506Z * [new branch] yiming/run_with_start_end_rng_hop -> origin/yiming/run_with_start_end_rng_hop 2025-12-04T08:56:59.7266072Z * [new branch] yolo-llama3 -> origin/yolo-llama3 2025-12-04T08:56:59.7268201Z * [new branch] zainr/canary-test -> origin/zainr/canary-test 2025-12-04T08:56:59.7269817Z * [new branch] zainr/cleanup-gh-runners -> origin/zainr/cleanup-gh-runners 2025-12-04T08:56:59.7271251Z * [new branch] zainr/pull-migration-c -> origin/zainr/pull-migration-c 2025-12-04T08:56:59.7272649Z * [new branch] zainr/test2 -> origin/zainr/test2 2025-12-04T08:56:59.7274366Z * [new branch] zasdfgbnm-patch-3 -> origin/zasdfgbnm-patch-3 2025-12-04T08:56:59.7276113Z * [new branch] zb2p -> origin/zb2p 2025-12-04T08:56:59.7277719Z * [new branch] zeros-and-scatter-part2 -> origin/zeros-and-scatter-part2 2025-12-04T08:56:59.7280189Z * [new branch] zhxchen17/ci/vllm_lora_oom -> origin/zhxchen17/ci/vllm_lora_oom 2025-12-04T08:56:59.7281753Z * [new branch] zhxchen17/ci/vllm_multimodal_oom -> origin/zhxchen17/ci/vllm_multimodal_oom 2025-12-04T08:56:59.7283253Z * [new branch] zhxchen17/ci/vllm_pin -> origin/zhxchen17/ci/vllm_pin 2025-12-04T08:56:59.7285482Z * [new branch] zhxchen17/dynamo/unsafe_drop_all_guards -> origin/zhxchen17/dynamo/unsafe_drop_all_guards 2025-12-04T08:56:59.7287882Z * [new branch] zhxchen17/export/call_override -> origin/zhxchen17/export/call_override 2025-12-04T08:56:59.7289445Z * [new branch] zhxchen17/export/codemod1 -> origin/zhxchen17/export/codemod1 2025-12-04T08:56:59.7291005Z * [new branch] zhxchen17/export/ctx_return -> origin/zhxchen17/export/ctx_return 2025-12-04T08:56:59.7292712Z * [new branch] zhxchen17/export/disable_side_effect_warn -> origin/zhxchen17/export/disable_side_effect_warn 2025-12-04T08:56:59.7294381Z * [new branch] zhxchen17/export/pytree_check -> origin/zhxchen17/export/pytree_check 2025-12-04T08:56:59.7296331Z * [new branch] zhxchen17/precompile/aoti -> origin/zhxchen17/precompile/aoti 2025-12-04T08:56:59.7297948Z * [new branch] zhxchen17/precompile/globals -> origin/zhxchen17/precompile/globals 2025-12-04T08:56:59.7299477Z * [new branch] zhxchen17/precompile/inductor_guards -> origin/zhxchen17/precompile/inductor_guards 2025-12-04T08:56:59.7301464Z * [new branch] zhxchen17/scratch/0 -> origin/zhxchen17/scratch/0 2025-12-04T08:56:59.7303045Z * [new branch] zhxchen17/torch_export_api_update -> origin/zhxchen17/torch_export_api_update 2025-12-04T08:56:59.7305216Z * [new branch] zhxhcen17/moodycamel -> origin/zhxhcen17/moodycamel 2025-12-04T08:56:59.7307368Z * [new branch] zxiiro/build-times -> origin/zxiiro/build-times 2025-12-04T08:56:59.7308863Z * [new branch] zxiiro/c7i.2xlarge -> origin/zxiiro/c7i.2xlarge 2025-12-04T08:56:59.7310414Z * [new branch] zxiiro/c7i.2xlarge.h100 -> origin/zxiiro/c7i.2xlarge.h100 2025-12-04T08:56:59.7311947Z * [new branch] zxiiro/main -> origin/zxiiro/main 2025-12-04T08:56:59.7313474Z * [new branch] zxiiro/risc64 -> origin/zxiiro/risc64 2025-12-04T08:56:59.7315050Z * [new branch] zxiiro/test-multicloud-arc -> origin/zxiiro/test-multicloud-arc 2025-12-04T08:56:59.7316486Z * [new tag] bc2caa7fdf006894eff7af936babde69ab5a40f8-huydhn-debug -> bc2caa7fdf006894eff7af936babde69ab5a40f8-huydhn-debug 2025-12-04T08:56:59.7317803Z * [new tag] ci/binaries/77164 -> ci/binaries/77164 2025-12-04T08:56:59.7319180Z * [new tag] ciflow/b200/115316 -> ciflow/b200/115316 2025-12-04T08:56:59.7320403Z * [new tag] ciflow/b200/160685 -> ciflow/b200/160685 2025-12-04T08:56:59.7321342Z * [new tag] ciflow/b200/161607 -> ciflow/b200/161607 2025-12-04T08:56:59.7322509Z * [new tag] ciflow/b200/161938 -> ciflow/b200/161938 2025-12-04T08:56:59.7323713Z * [new tag] ciflow/b200/167207 -> ciflow/b200/167207 2025-12-04T08:56:59.7324580Z * [new tag] ciflow/b200/167989 -> ciflow/b200/167989 2025-12-04T08:56:59.7325872Z * [new tag] ciflow/b200/168096 -> ciflow/b200/168096 2025-12-04T08:56:59.7327017Z * [new tag] ciflow/b200/168175 -> ciflow/b200/168175 2025-12-04T08:56:59.7327921Z * [new tag] ciflow/b200/168195 -> ciflow/b200/168195 2025-12-04T08:56:59.7329349Z * [new tag] ciflow/b200/169200 -> ciflow/b200/169200 2025-12-04T08:56:59.7330343Z * [new tag] ciflow/b200/169216 -> ciflow/b200/169216 2025-12-04T08:56:59.7332053Z * [new tag] ciflow/b200/169380 -> ciflow/b200/169380 2025-12-04T08:56:59.7333798Z * [new tag] ciflow/b200/169412 -> ciflow/b200/169412 2025-12-04T08:56:59.7335318Z * [new tag] ciflow/b200/169470 -> ciflow/b200/169470 2025-12-04T08:56:59.7336293Z * [new tag] ciflow/b200/169471 -> ciflow/b200/169471 2025-12-04T08:56:59.7337586Z * [new tag] ciflow/b200/169472 -> ciflow/b200/169472 2025-12-04T08:56:59.7338876Z * [new tag] ciflow/b200/169514 -> ciflow/b200/169514 2025-12-04T08:56:59.7339865Z * [new tag] ciflow/b200/169517 -> ciflow/b200/169517 2025-12-04T08:56:59.7341422Z * [new tag] ciflow/binaries/165922 -> ciflow/binaries/165922 2025-12-04T08:56:59.7342357Z * [new tag] ciflow/binaries/169510 -> ciflow/binaries/169510 2025-12-04T08:56:59.7343818Z * [new tag] ciflow/binaries_wheel/157994 -> ciflow/binaries_wheel/157994 2025-12-04T08:56:59.7344791Z * [new tag] ciflow/binaries_wheel/166829 -> ciflow/binaries_wheel/166829 2025-12-04T08:56:59.7345988Z * [new tag] ciflow/binaries_wheel/167972 -> ciflow/binaries_wheel/167972 2025-12-04T08:56:59.7347065Z * [new tag] ciflow/binaries_wheel/167981 -> ciflow/binaries_wheel/167981 2025-12-04T08:56:59.7348361Z * [new tag] ciflow/dynamo/167695 -> ciflow/dynamo/167695 2025-12-04T08:56:59.7349353Z * [new tag] ciflow/dynamo/168096 -> ciflow/dynamo/168096 2025-12-04T08:56:59.7350650Z * [new tag] ciflow/dynamo/169525 -> ciflow/dynamo/169525 2025-12-04T08:56:59.7351983Z * [new tag] ciflow/h100-cutlass-backend/161938 -> ciflow/h100-cutlass-backend/161938 2025-12-04T08:56:59.7352870Z * [new tag] ciflow/h100-cutlass-backend/161940 -> ciflow/h100-cutlass-backend/161940 2025-12-04T08:56:59.7354360Z * [new tag] ciflow/h100-distributed/168923 -> ciflow/h100-distributed/168923 2025-12-04T08:56:59.7355397Z * [new tag] ciflow/h100-symm-mem/167552 -> ciflow/h100-symm-mem/167552 2025-12-04T08:56:59.7356402Z * [new tag] ciflow/h100-symm-mem/168129 -> ciflow/h100-symm-mem/168129 2025-12-04T08:56:59.7357395Z * [new tag] ciflow/h100-symm-mem/168917 -> ciflow/h100-symm-mem/168917 2025-12-04T08:56:59.7358796Z * [new tag] ciflow/h100-symm-mem/169156 -> ciflow/h100-symm-mem/169156 2025-12-04T08:56:59.7360320Z * [new tag] ciflow/h100-symm-mem/169200 -> ciflow/h100-symm-mem/169200 2025-12-04T08:56:59.7361219Z * [new tag] ciflow/h100-symm-mem/169216 -> ciflow/h100-symm-mem/169216 2025-12-04T08:56:59.7362360Z * [new tag] ciflow/h100-symm-mem/169338 -> ciflow/h100-symm-mem/169338 2025-12-04T08:56:59.7363927Z * [new tag] ciflow/h100-symm-mem/169355 -> ciflow/h100-symm-mem/169355 2025-12-04T08:56:59.7364899Z * [new tag] ciflow/h100-symm-mem/169543 -> ciflow/h100-symm-mem/169543 2025-12-04T08:56:59.7366241Z * [new tag] ciflow/h100/115316 -> ciflow/h100/115316 2025-12-04T08:56:59.7367117Z * [new tag] ciflow/h100/160685 -> ciflow/h100/160685 2025-12-04T08:56:59.7368301Z * [new tag] ciflow/h100/160729 -> ciflow/h100/160729 2025-12-04T08:56:59.7369214Z * [new tag] ciflow/h100/161607 -> ciflow/h100/161607 2025-12-04T08:56:59.7370182Z * [new tag] ciflow/h100/161938 -> ciflow/h100/161938 2025-12-04T08:56:59.7371309Z * [new tag] ciflow/h100/167207 -> ciflow/h100/167207 2025-12-04T08:56:59.7372477Z * [new tag] ciflow/h100/167989 -> ciflow/h100/167989 2025-12-04T08:56:59.7373460Z * [new tag] ciflow/h100/168096 -> ciflow/h100/168096 2025-12-04T08:56:59.7374421Z * [new tag] ciflow/h100/168175 -> ciflow/h100/168175 2025-12-04T08:56:59.7375409Z * [new tag] ciflow/h100/168195 -> ciflow/h100/168195 2025-12-04T08:56:59.7376623Z * [new tag] ciflow/h100/168980 -> ciflow/h100/168980 2025-12-04T08:56:59.7377952Z * [new tag] ciflow/h100/169200 -> ciflow/h100/169200 2025-12-04T08:56:59.7379382Z * [new tag] ciflow/h100/169216 -> ciflow/h100/169216 2025-12-04T08:56:59.7380670Z * [new tag] ciflow/h100/169380 -> ciflow/h100/169380 2025-12-04T08:56:59.7381643Z * [new tag] ciflow/h100/169412 -> ciflow/h100/169412 2025-12-04T08:56:59.7382945Z * [new tag] ciflow/h100/169470 -> ciflow/h100/169470 2025-12-04T08:56:59.7383862Z * [new tag] ciflow/h100/169471 -> ciflow/h100/169471 2025-12-04T08:56:59.7385009Z * [new tag] ciflow/h100/169472 -> ciflow/h100/169472 2025-12-04T08:56:59.7386084Z * [new tag] ciflow/h100/169514 -> ciflow/h100/169514 2025-12-04T08:56:59.7387493Z * [new tag] ciflow/inductor-cu126/168096 -> ciflow/inductor-cu126/168096 2025-12-04T08:56:59.7389040Z * [new tag] ciflow/inductor-micro-benchmark-cpu-x86/168096 -> ciflow/inductor-micro-benchmark-cpu-x86/168096 2025-12-04T08:56:59.7390124Z * [new tag] ciflow/inductor-micro-benchmark/166165 -> ciflow/inductor-micro-benchmark/166165 2025-12-04T08:56:59.7391137Z * [new tag] ciflow/inductor-micro-benchmark/168096 -> ciflow/inductor-micro-benchmark/168096 2025-12-04T08:56:59.7393083Z * [new tag] ciflow/inductor-perf-compare/168096 -> ciflow/inductor-perf-compare/168096 2025-12-04T08:56:59.7394744Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/168073 -> ciflow/inductor-perf-test-nightly-rocm-mi300/168073 2025-12-04T08:56:59.7395673Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/168096 -> ciflow/inductor-perf-test-nightly-rocm-mi300/168096 2025-12-04T08:56:59.7396804Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/169024 -> ciflow/inductor-perf-test-nightly-rocm-mi300/169024 2025-12-04T08:56:59.7398252Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi355/169024 -> ciflow/inductor-perf-test-nightly-rocm-mi355/169024 2025-12-04T08:56:59.7399290Z * [new tag] ciflow/inductor-perf-test-nightly/168096 -> ciflow/inductor-perf-test-nightly/168096 2025-12-04T08:56:59.7400635Z * [new tag] ciflow/inductor-periodic/168096 -> ciflow/inductor-periodic/168096 2025-12-04T08:56:59.7401545Z * [new tag] ciflow/inductor-periodic/169024 -> ciflow/inductor-periodic/169024 2025-12-04T08:56:59.7402649Z * [new tag] ciflow/inductor-periodic/169425 -> ciflow/inductor-periodic/169425 2025-12-04T08:56:59.7404116Z * [new tag] ciflow/inductor-rocm-mi200/165545 -> ciflow/inductor-rocm-mi200/165545 2025-12-04T08:56:59.7405139Z * [new tag] ciflow/inductor-rocm-mi200/165997 -> ciflow/inductor-rocm-mi200/165997 2025-12-04T08:56:59.7406121Z * [new tag] ciflow/inductor-rocm-mi200/168096 -> ciflow/inductor-rocm-mi200/168096 2025-12-04T08:56:59.7407324Z * [new tag] ciflow/inductor-rocm-mi200/169063 -> ciflow/inductor-rocm-mi200/169063 2025-12-04T08:56:59.7408258Z * [new tag] ciflow/inductor-rocm-mi200/169425 -> ciflow/inductor-rocm-mi200/169425 2025-12-04T08:56:59.7409685Z * [new tag] ciflow/inductor-rocm-mi300/165545 -> ciflow/inductor-rocm-mi300/165545 2025-12-04T08:56:59.7410710Z * [new tag] ciflow/inductor-rocm-mi300/168096 -> ciflow/inductor-rocm-mi300/168096 2025-12-04T08:56:59.7411620Z * [new tag] ciflow/inductor-rocm-mi300/169063 -> ciflow/inductor-rocm-mi300/169063 2025-12-04T08:56:59.7412741Z * [new tag] ciflow/inductor-rocm-mi300/169425 -> ciflow/inductor-rocm-mi300/169425 2025-12-04T08:56:59.7414210Z * [new tag] ciflow/inductor-rocm/162052 -> ciflow/inductor-rocm/162052 2025-12-04T08:56:59.7415258Z * [new tag] ciflow/inductor-rocm/168971 -> ciflow/inductor-rocm/168971 2025-12-04T08:56:59.7416637Z * [new tag] ciflow/inductor-windows/168096 -> ciflow/inductor-windows/168096 2025-12-04T08:56:59.7417901Z * [new tag] ciflow/inductor/144542 -> ciflow/inductor/144542 2025-12-04T08:56:59.7418779Z * [new tag] ciflow/inductor/146506 -> ciflow/inductor/146506 2025-12-04T08:56:59.7419945Z * [new tag] ciflow/inductor/147990 -> ciflow/inductor/147990 2025-12-04T08:56:59.7421110Z * [new tag] ciflow/inductor/148294 -> ciflow/inductor/148294 2025-12-04T08:56:59.7422058Z * [new tag] ciflow/inductor/148492 -> ciflow/inductor/148492 2025-12-04T08:56:59.7423100Z * [new tag] ciflow/inductor/157149 -> ciflow/inductor/157149 2025-12-04T08:56:59.7424371Z * [new tag] ciflow/inductor/157994 -> ciflow/inductor/157994 2025-12-04T08:56:59.7425697Z * [new tag] ciflow/inductor/160685 -> ciflow/inductor/160685 2025-12-04T08:56:59.7426599Z * [new tag] ciflow/inductor/160686 -> ciflow/inductor/160686 2025-12-04T08:56:59.7427689Z * [new tag] ciflow/inductor/160687 -> ciflow/inductor/160687 2025-12-04T08:56:59.7428701Z * [new tag] ciflow/inductor/160688 -> ciflow/inductor/160688 2025-12-04T08:56:59.7430116Z * [new tag] ciflow/inductor/160706 -> ciflow/inductor/160706 2025-12-04T08:56:59.7431534Z * [new tag] ciflow/inductor/160729 -> ciflow/inductor/160729 2025-12-04T08:56:59.7432935Z * [new tag] ciflow/inductor/161938 -> ciflow/inductor/161938 2025-12-04T08:56:59.7433953Z * [new tag] ciflow/inductor/161939 -> ciflow/inductor/161939 2025-12-04T08:56:59.7435090Z * [new tag] ciflow/inductor/161940 -> ciflow/inductor/161940 2025-12-04T08:56:59.7436083Z * [new tag] ciflow/inductor/162052 -> ciflow/inductor/162052 2025-12-04T08:56:59.7437378Z * [new tag] ciflow/inductor/162275 -> ciflow/inductor/162275 2025-12-04T08:56:59.7438306Z * [new tag] ciflow/inductor/162795 -> ciflow/inductor/162795 2025-12-04T08:56:59.7439747Z * [new tag] ciflow/inductor/163245 -> ciflow/inductor/163245 2025-12-04T08:56:59.7440669Z * [new tag] ciflow/inductor/163335 -> ciflow/inductor/163335 2025-12-04T08:56:59.7441960Z * [new tag] ciflow/inductor/163503 -> ciflow/inductor/163503 2025-12-04T08:56:59.7442915Z * [new tag] ciflow/inductor/163942 -> ciflow/inductor/163942 2025-12-04T08:56:59.7444275Z * [new tag] ciflow/inductor/165270 -> ciflow/inductor/165270 2025-12-04T08:56:59.7445230Z * [new tag] ciflow/inductor/165274 -> ciflow/inductor/165274 2025-12-04T08:56:59.7446435Z * [new tag] ciflow/inductor/165322 -> ciflow/inductor/165322 2025-12-04T08:56:59.7447424Z * [new tag] ciflow/inductor/165597 -> ciflow/inductor/165597 2025-12-04T08:56:59.7448830Z * [new tag] ciflow/inductor/166063 -> ciflow/inductor/166063 2025-12-04T08:56:59.7449646Z * [new tag] ciflow/inductor/166075 -> ciflow/inductor/166075 2025-12-04T08:56:59.7450812Z * [new tag] ciflow/inductor/166165 -> ciflow/inductor/166165 2025-12-04T08:56:59.7452380Z * [new tag] ciflow/inductor/166254 -> ciflow/inductor/166254 2025-12-04T08:56:59.7453326Z * [new tag] ciflow/inductor/166483 -> ciflow/inductor/166483 2025-12-04T08:56:59.7454441Z * [new tag] ciflow/inductor/166494 -> ciflow/inductor/166494 2025-12-04T08:56:59.7455717Z * [new tag] ciflow/inductor/166545 -> ciflow/inductor/166545 2025-12-04T08:56:59.7456613Z * [new tag] ciflow/inductor/166788 -> ciflow/inductor/166788 2025-12-04T08:56:59.7458044Z * [new tag] ciflow/inductor/166846 -> ciflow/inductor/166846 2025-12-04T08:56:59.7459014Z * [new tag] ciflow/inductor/167300 -> ciflow/inductor/167300 2025-12-04T08:56:59.7460193Z * [new tag] ciflow/inductor/167407 -> ciflow/inductor/167407 2025-12-04T08:56:59.7461444Z * [new tag] ciflow/inductor/167536 -> ciflow/inductor/167536 2025-12-04T08:56:59.7462417Z * [new tag] ciflow/inductor/167552 -> ciflow/inductor/167552 2025-12-04T08:56:59.7463868Z * [new tag] ciflow/inductor/167555 -> ciflow/inductor/167555 2025-12-04T08:56:59.7464921Z * [new tag] ciflow/inductor/167583 -> ciflow/inductor/167583 2025-12-04T08:56:59.7466074Z * [new tag] ciflow/inductor/167599 -> ciflow/inductor/167599 2025-12-04T08:56:59.7467126Z * [new tag] ciflow/inductor/167647 -> ciflow/inductor/167647 2025-12-04T08:56:59.7468268Z * [new tag] ciflow/inductor/167677 -> ciflow/inductor/167677 2025-12-04T08:56:59.7469380Z * [new tag] ciflow/inductor/167680 -> ciflow/inductor/167680 2025-12-04T08:56:59.7470388Z * [new tag] ciflow/inductor/167695 -> ciflow/inductor/167695 2025-12-04T08:56:59.7471640Z * [new tag] ciflow/inductor/167742 -> ciflow/inductor/167742 2025-12-04T08:56:59.7472610Z * [new tag] ciflow/inductor/167768 -> ciflow/inductor/167768 2025-12-04T08:56:59.7474090Z * [new tag] ciflow/inductor/167773 -> ciflow/inductor/167773 2025-12-04T08:56:59.7475012Z * [new tag] ciflow/inductor/167781 -> ciflow/inductor/167781 2025-12-04T08:56:59.7476308Z * [new tag] ciflow/inductor/167880 -> ciflow/inductor/167880 2025-12-04T08:56:59.7477271Z * [new tag] ciflow/inductor/167887 -> ciflow/inductor/167887 2025-12-04T08:56:59.7478534Z * [new tag] ciflow/inductor/167972 -> ciflow/inductor/167972 2025-12-04T08:56:59.7479464Z * [new tag] ciflow/inductor/167989 -> ciflow/inductor/167989 2025-12-04T08:56:59.7480661Z * [new tag] ciflow/inductor/168002 -> ciflow/inductor/168002 2025-12-04T08:56:59.7481668Z * [new tag] ciflow/inductor/168050 -> ciflow/inductor/168050 2025-12-04T08:56:59.7482950Z * [new tag] ciflow/inductor/168051 -> ciflow/inductor/168051 2025-12-04T08:56:59.7483930Z * [new tag] ciflow/inductor/168052 -> ciflow/inductor/168052 2025-12-04T08:56:59.7485078Z * [new tag] ciflow/inductor/168073 -> ciflow/inductor/168073 2025-12-04T08:56:59.7486108Z * [new tag] ciflow/inductor/168096 -> ciflow/inductor/168096 2025-12-04T08:56:59.7487507Z * [new tag] ciflow/inductor/168114 -> ciflow/inductor/168114 2025-12-04T08:56:59.7488316Z * [new tag] ciflow/inductor/168115 -> ciflow/inductor/168115 2025-12-04T08:56:59.7489466Z * [new tag] ciflow/inductor/168127 -> ciflow/inductor/168127 2025-12-04T08:56:59.7490618Z * [new tag] ciflow/inductor/168129 -> ciflow/inductor/168129 2025-12-04T08:56:59.7491648Z * [new tag] ciflow/inductor/168157 -> ciflow/inductor/168157 2025-12-04T08:56:59.7492784Z * [new tag] ciflow/inductor/168175 -> ciflow/inductor/168175 2025-12-04T08:56:59.7494360Z * [new tag] ciflow/inductor/168185 -> ciflow/inductor/168185 2025-12-04T08:56:59.7495123Z * [new tag] ciflow/inductor/168195 -> ciflow/inductor/168195 2025-12-04T08:56:59.7496297Z * [new tag] ciflow/inductor/168209 -> ciflow/inductor/168209 2025-12-04T08:56:59.7497300Z * [new tag] ciflow/inductor/168266 -> ciflow/inductor/168266 2025-12-04T08:56:59.7498993Z * [new tag] ciflow/inductor/168316 -> ciflow/inductor/168316 2025-12-04T08:56:59.7500242Z * [new tag] ciflow/inductor/168326 -> ciflow/inductor/168326 2025-12-04T08:56:59.7501247Z * [new tag] ciflow/inductor/168368 -> ciflow/inductor/168368 2025-12-04T08:56:59.7502383Z * [new tag] ciflow/inductor/168894 -> ciflow/inductor/168894 2025-12-04T08:56:59.7503581Z * [new tag] ciflow/inductor/168934 -> ciflow/inductor/168934 2025-12-04T08:56:59.7504540Z * [new tag] ciflow/inductor/168939 -> ciflow/inductor/168939 2025-12-04T08:56:59.7505829Z * [new tag] ciflow/inductor/168946 -> ciflow/inductor/168946 2025-12-04T08:56:59.7506759Z * [new tag] ciflow/inductor/168950 -> ciflow/inductor/168950 2025-12-04T08:56:59.7508001Z * [new tag] ciflow/inductor/168951 -> ciflow/inductor/168951 2025-12-04T08:56:59.7508969Z * [new tag] ciflow/inductor/168952 -> ciflow/inductor/168952 2025-12-04T08:56:59.7510251Z * [new tag] ciflow/inductor/168955 -> ciflow/inductor/168955 2025-12-04T08:56:59.7511131Z * [new tag] ciflow/inductor/168971 -> ciflow/inductor/168971 2025-12-04T08:56:59.7512367Z * [new tag] ciflow/inductor/168979 -> ciflow/inductor/168979 2025-12-04T08:56:59.7513370Z * [new tag] ciflow/inductor/168980 -> ciflow/inductor/168980 2025-12-04T08:56:59.7514761Z * [new tag] ciflow/inductor/168983 -> ciflow/inductor/168983 2025-12-04T08:56:59.7515701Z * [new tag] ciflow/inductor/169006 -> ciflow/inductor/169006 2025-12-04T08:56:59.7516943Z * [new tag] ciflow/inductor/169023 -> ciflow/inductor/169023 2025-12-04T08:56:59.7517931Z * [new tag] ciflow/inductor/169024 -> ciflow/inductor/169024 2025-12-04T08:56:59.7519172Z * [new tag] ciflow/inductor/169025 -> ciflow/inductor/169025 2025-12-04T08:56:59.7520132Z * [new tag] ciflow/inductor/169066 -> ciflow/inductor/169066 2025-12-04T08:56:59.7521375Z * [new tag] ciflow/inductor/169091 -> ciflow/inductor/169091 2025-12-04T08:56:59.7522368Z * [new tag] ciflow/inductor/169102 -> ciflow/inductor/169102 2025-12-04T08:56:59.7523549Z * [new tag] ciflow/inductor/169103 -> ciflow/inductor/169103 2025-12-04T08:56:59.7524556Z * [new tag] ciflow/inductor/169121 -> ciflow/inductor/169121 2025-12-04T08:56:59.7525695Z * [new tag] ciflow/inductor/169134 -> ciflow/inductor/169134 2025-12-04T08:56:59.7526758Z * [new tag] ciflow/inductor/169135 -> ciflow/inductor/169135 2025-12-04T08:56:59.7527906Z * [new tag] ciflow/inductor/169141 -> ciflow/inductor/169141 2025-12-04T08:56:59.7528906Z * [new tag] ciflow/inductor/169151 -> ciflow/inductor/169151 2025-12-04T08:56:59.7530479Z * [new tag] ciflow/inductor/169161 -> ciflow/inductor/169161 2025-12-04T08:56:59.7531651Z * [new tag] ciflow/inductor/169167 -> ciflow/inductor/169167 2025-12-04T08:56:59.7533066Z * [new tag] ciflow/inductor/169177 -> ciflow/inductor/169177 2025-12-04T08:56:59.7534094Z * [new tag] ciflow/inductor/169185 -> ciflow/inductor/169185 2025-12-04T08:56:59.7535260Z * [new tag] ciflow/inductor/169196 -> ciflow/inductor/169196 2025-12-04T08:56:59.7536448Z * [new tag] ciflow/inductor/169200 -> ciflow/inductor/169200 2025-12-04T08:56:59.7537443Z * [new tag] ciflow/inductor/169204 -> ciflow/inductor/169204 2025-12-04T08:56:59.7538595Z * [new tag] ciflow/inductor/169216 -> ciflow/inductor/169216 2025-12-04T08:56:59.7539725Z * [new tag] ciflow/inductor/169219 -> ciflow/inductor/169219 2025-12-04T08:56:59.7540764Z * [new tag] ciflow/inductor/169220 -> ciflow/inductor/169220 2025-12-04T08:56:59.7542159Z * [new tag] ciflow/inductor/169230 -> ciflow/inductor/169230 2025-12-04T08:56:59.7543067Z * [new tag] ciflow/inductor/169242 -> ciflow/inductor/169242 2025-12-04T08:56:59.7544357Z * [new tag] ciflow/inductor/169245 -> ciflow/inductor/169245 2025-12-04T08:56:59.7545540Z * [new tag] ciflow/inductor/169260 -> ciflow/inductor/169260 2025-12-04T08:56:59.7546568Z * [new tag] ciflow/inductor/169282 -> ciflow/inductor/169282 2025-12-04T08:56:59.7547726Z * [new tag] ciflow/inductor/169286 -> ciflow/inductor/169286 2025-12-04T08:56:59.7548812Z * [new tag] ciflow/inductor/169299 -> ciflow/inductor/169299 2025-12-04T08:56:59.7550239Z * [new tag] ciflow/inductor/169304 -> ciflow/inductor/169304 2025-12-04T08:56:59.7551689Z * [new tag] ciflow/inductor/169305 -> ciflow/inductor/169305 2025-12-04T08:56:59.7552695Z * [new tag] ciflow/inductor/169308 -> ciflow/inductor/169308 2025-12-04T08:56:59.7553857Z * [new tag] ciflow/inductor/169319 -> ciflow/inductor/169319 2025-12-04T08:56:59.7555059Z * [new tag] ciflow/inductor/169326 -> ciflow/inductor/169326 2025-12-04T08:56:59.7556017Z * [new tag] ciflow/inductor/169332 -> ciflow/inductor/169332 2025-12-04T08:56:59.7557180Z * [new tag] ciflow/inductor/169333 -> ciflow/inductor/169333 2025-12-04T08:56:59.7558614Z * [new tag] ciflow/inductor/169336 -> ciflow/inductor/169336 2025-12-04T08:56:59.7559535Z * [new tag] ciflow/inductor/169340 -> ciflow/inductor/169340 2025-12-04T08:56:59.7560775Z * [new tag] ciflow/inductor/169341 -> ciflow/inductor/169341 2025-12-04T08:56:59.7561903Z * [new tag] ciflow/inductor/169343 -> ciflow/inductor/169343 2025-12-04T08:56:59.7563277Z * [new tag] ciflow/inductor/169346 -> ciflow/inductor/169346 2025-12-04T08:56:59.7564463Z * [new tag] ciflow/inductor/169348 -> ciflow/inductor/169348 2025-12-04T08:56:59.7565612Z * [new tag] ciflow/inductor/169350 -> ciflow/inductor/169350 2025-12-04T08:56:59.7566708Z * [new tag] ciflow/inductor/169355 -> ciflow/inductor/169355 2025-12-04T08:56:59.7567998Z * [new tag] ciflow/inductor/169370 -> ciflow/inductor/169370 2025-12-04T08:56:59.7569459Z * [new tag] ciflow/inductor/169375 -> ciflow/inductor/169375 2025-12-04T08:56:59.7570372Z * [new tag] ciflow/inductor/169389 -> ciflow/inductor/169389 2025-12-04T08:56:59.7571551Z * [new tag] ciflow/inductor/169391 -> ciflow/inductor/169391 2025-12-04T08:56:59.7573258Z * [new tag] ciflow/inductor/169393 -> ciflow/inductor/169393 2025-12-04T08:56:59.7574316Z * [new tag] ciflow/inductor/169399 -> ciflow/inductor/169399 2025-12-04T08:56:59.7575638Z * [new tag] ciflow/inductor/169400 -> ciflow/inductor/169400 2025-12-04T08:56:59.7576605Z * [new tag] ciflow/inductor/169415 -> ciflow/inductor/169415 2025-12-04T08:56:59.7577904Z * [new tag] ciflow/inductor/169417 -> ciflow/inductor/169417 2025-12-04T08:56:59.7579118Z * [new tag] ciflow/inductor/169418 -> ciflow/inductor/169418 2025-12-04T08:56:59.7580238Z * [new tag] ciflow/inductor/169430 -> ciflow/inductor/169430 2025-12-04T08:56:59.7581413Z * [new tag] ciflow/inductor/169432 -> ciflow/inductor/169432 2025-12-04T08:56:59.7582457Z * [new tag] ciflow/inductor/169436 -> ciflow/inductor/169436 2025-12-04T08:56:59.7583631Z * [new tag] ciflow/inductor/169437 -> ciflow/inductor/169437 2025-12-04T08:56:59.7596487Z * [new tag] ciflow/inductor/169438 -> ciflow/inductor/169438 2025-12-04T08:56:59.7596662Z * [new tag] ciflow/inductor/169441 -> ciflow/inductor/169441 2025-12-04T08:56:59.7596803Z * [new tag] ciflow/inductor/169446 -> ciflow/inductor/169446 2025-12-04T08:56:59.7596929Z * [new tag] ciflow/inductor/169447 -> ciflow/inductor/169447 2025-12-04T08:56:59.7597062Z * [new tag] ciflow/inductor/169452 -> ciflow/inductor/169452 2025-12-04T08:56:59.7597194Z * [new tag] ciflow/inductor/169455 -> ciflow/inductor/169455 2025-12-04T08:56:59.7597319Z * [new tag] ciflow/inductor/169459 -> ciflow/inductor/169459 2025-12-04T08:56:59.7597446Z * [new tag] ciflow/inductor/169463 -> ciflow/inductor/169463 2025-12-04T08:56:59.7597568Z * [new tag] ciflow/inductor/169476 -> ciflow/inductor/169476 2025-12-04T08:56:59.7597686Z * [new tag] ciflow/inductor/169485 -> ciflow/inductor/169485 2025-12-04T08:56:59.7597813Z * [new tag] ciflow/inductor/169493 -> ciflow/inductor/169493 2025-12-04T08:56:59.7597934Z * [new tag] ciflow/inductor/169496 -> ciflow/inductor/169496 2025-12-04T08:56:59.7598631Z * [new tag] ciflow/inductor/169497 -> ciflow/inductor/169497 2025-12-04T08:56:59.7599885Z * [new tag] ciflow/inductor/169503 -> ciflow/inductor/169503 2025-12-04T08:56:59.7601022Z * [new tag] ciflow/inductor/169504 -> ciflow/inductor/169504 2025-12-04T08:56:59.7602617Z * [new tag] ciflow/inductor/169505 -> ciflow/inductor/169505 2025-12-04T08:56:59.7603918Z * [new tag] ciflow/inductor/169508 -> ciflow/inductor/169508 2025-12-04T08:56:59.7604885Z * [new tag] ciflow/inductor/169509 -> ciflow/inductor/169509 2025-12-04T08:56:59.7606246Z * [new tag] ciflow/inductor/169513 -> ciflow/inductor/169513 2025-12-04T08:56:59.7607261Z * [new tag] ciflow/inductor/169514 -> ciflow/inductor/169514 2025-12-04T08:56:59.7608526Z * [new tag] ciflow/inductor/169515 -> ciflow/inductor/169515 2025-12-04T08:56:59.7609491Z * [new tag] ciflow/inductor/169517 -> ciflow/inductor/169517 2025-12-04T08:56:59.7610785Z * [new tag] ciflow/inductor/169519 -> ciflow/inductor/169519 2025-12-04T08:56:59.7611781Z * [new tag] ciflow/inductor/169520 -> ciflow/inductor/169520 2025-12-04T08:56:59.7613186Z * [new tag] ciflow/inductor/169521 -> ciflow/inductor/169521 2025-12-04T08:56:59.7614116Z * [new tag] ciflow/inductor/169524 -> ciflow/inductor/169524 2025-12-04T08:56:59.7615410Z * [new tag] ciflow/inductor/169527 -> ciflow/inductor/169527 2025-12-04T08:56:59.7616356Z * [new tag] ciflow/inductor/169528 -> ciflow/inductor/169528 2025-12-04T08:56:59.7617700Z * [new tag] ciflow/inductor/169532 -> ciflow/inductor/169532 2025-12-04T08:56:59.7618699Z * [new tag] ciflow/inductor/169535 -> ciflow/inductor/169535 2025-12-04T08:56:59.7619861Z * [new tag] ciflow/inductor/169536 -> ciflow/inductor/169536 2025-12-04T08:56:59.7621119Z * [new tag] ciflow/inductor/169547 -> ciflow/inductor/169547 2025-12-04T08:56:59.7622218Z * [new tag] ciflow/inductor/169548 -> ciflow/inductor/169548 2025-12-04T08:56:59.7623257Z * [new tag] ciflow/inductor/169549 -> ciflow/inductor/169549 2025-12-04T08:56:59.7624445Z * [new tag] ciflow/inductor/169551 -> ciflow/inductor/169551 2025-12-04T08:56:59.7625461Z * [new tag] ciflow/inductor/169552 -> ciflow/inductor/169552 2025-12-04T08:56:59.7626712Z * [new tag] ciflow/inductor/169553 -> ciflow/inductor/169553 2025-12-04T08:56:59.7628064Z * [new tag] ciflow/inductor/3b9a386 -> ciflow/inductor/3b9a386 2025-12-04T08:56:59.7629331Z * [new tag] ciflow/inductor/3d4b92b -> ciflow/inductor/3d4b92b 2025-12-04T08:56:59.7630582Z * [new tag] ciflow/inductor/d224ac7 -> ciflow/inductor/d224ac7 2025-12-04T08:56:59.7631878Z * [new tag] ciflow/linux-aarch64/157994 -> ciflow/linux-aarch64/157994 2025-12-04T08:56:59.7632784Z * [new tag] ciflow/linux-aarch64/166075 -> ciflow/linux-aarch64/166075 2025-12-04T08:56:59.7633969Z * [new tag] ciflow/linux-aarch64/166876 -> ciflow/linux-aarch64/166876 2025-12-04T08:56:59.7634917Z * [new tag] ciflow/linux-aarch64/167981 -> ciflow/linux-aarch64/167981 2025-12-04T08:56:59.7636331Z * [new tag] ciflow/mps/166254 -> ciflow/mps/166254 2025-12-04T08:56:59.7637248Z * [new tag] ciflow/mps/169017 -> ciflow/mps/169017 2025-12-04T08:56:59.7638662Z * [new tag] ciflow/mps/169372 -> ciflow/mps/169372 2025-12-04T08:56:59.7639461Z * [new tag] ciflow/mps/169478 -> ciflow/mps/169478 2025-12-04T08:56:59.7640970Z * [new tag] ciflow/op-benchmark/157994 -> ciflow/op-benchmark/157994 2025-12-04T08:56:59.7641838Z * [new tag] ciflow/op-benchmark/166075 -> ciflow/op-benchmark/166075 2025-12-04T08:56:59.7642994Z * [new tag] ciflow/op-benchmark/169544 -> ciflow/op-benchmark/169544 2025-12-04T08:56:59.7644230Z * [new tag] ciflow/periodic-rocm-mi200/165997 -> ciflow/periodic-rocm-mi200/165997 2025-12-04T08:56:59.7645388Z * [new tag] ciflow/periodic-rocm-mi200/166517 -> ciflow/periodic-rocm-mi200/166517 2025-12-04T08:56:59.7646446Z * [new tag] ciflow/periodic-rocm-mi200/169063 -> ciflow/periodic-rocm-mi200/169063 2025-12-04T08:56:59.7647567Z * [new tag] ciflow/periodic-rocm-mi200/169425 -> ciflow/periodic-rocm-mi200/169425 2025-12-04T08:56:59.7648750Z * [new tag] ciflow/periodic-rocm-mi300/166517 -> ciflow/periodic-rocm-mi300/166517 2025-12-04T08:56:59.7649883Z * [new tag] ciflow/periodic-rocm-mi300/169063 -> ciflow/periodic-rocm-mi300/169063 2025-12-04T08:56:59.7650905Z * [new tag] ciflow/periodic-rocm-mi300/169425 -> ciflow/periodic-rocm-mi300/169425 2025-12-04T08:56:59.7652362Z * [new tag] ciflow/periodic/054a2fd -> ciflow/periodic/054a2fd 2025-12-04T08:56:59.7653470Z * [new tag] ciflow/periodic/167207 -> ciflow/periodic/167207 2025-12-04T08:56:59.7654584Z * [new tag] ciflow/periodic/167978 -> ciflow/periodic/167978 2025-12-04T08:56:59.7655728Z * [new tag] ciflow/periodic/168096 -> ciflow/periodic/168096 2025-12-04T08:56:59.7656673Z * [new tag] ciflow/periodic/169286 -> ciflow/periodic/169286 2025-12-04T08:56:59.7658117Z * [new tag] ciflow/periodic/2a6d37d -> ciflow/periodic/2a6d37d 2025-12-04T08:56:59.7659267Z * [new tag] ciflow/periodic/317eeb8 -> ciflow/periodic/317eeb8 2025-12-04T08:56:59.7660396Z * [new tag] ciflow/periodic/3c32 -> ciflow/periodic/3c32 2025-12-04T08:56:59.7661717Z * [new tag] ciflow/periodic/3e98831 -> ciflow/periodic/3e98831 2025-12-04T08:56:59.7665148Z * [new tag] ciflow/periodic/7c648509a7470ace9fb2bae960dd4790f7e943e9 -> ciflow/periodic/7c648509a7470ace9fb2bae960dd4790f7e943e9 2025-12-04T08:56:59.7666276Z * [new tag] ciflow/periodic/94512-point -> ciflow/periodic/94512-point 2025-12-04T08:56:59.7667926Z * [new tag] ciflow/periodic/csl/test87519 -> ciflow/periodic/csl/test87519 2025-12-04T08:56:59.7669257Z * [new tag] ciflow/periodic/csltest88275 -> ciflow/periodic/csltest88275 2025-12-04T08:56:59.7670296Z * [new tag] ciflow/periodic/csltest88761 -> ciflow/periodic/csltest88761 2025-12-04T08:56:59.7671742Z * [new tag] ciflow/periodic/release_1.12 -> ciflow/periodic/release_1.12 2025-12-04T08:56:59.7673144Z * [new tag] ciflow/periodic/release_1.12.0 -> ciflow/periodic/release_1.12.0 2025-12-04T08:56:59.7674481Z * [new tag] ciflow/periodic/sha-ec5b83 -> ciflow/periodic/sha-ec5b83 2025-12-04T08:56:59.7675830Z * [new tag] ciflow/pull/167207 -> ciflow/pull/167207 2025-12-04T08:56:59.7677316Z * [new tag] ciflow/quantization-periodic/169207 -> ciflow/quantization-periodic/169207 2025-12-04T08:56:59.7678382Z * [new tag] ciflow/rocm-mi200/165545 -> ciflow/rocm-mi200/165545 2025-12-04T08:56:59.7679407Z * [new tag] ciflow/rocm-mi200/165997 -> ciflow/rocm-mi200/165997 2025-12-04T08:56:59.7680445Z * [new tag] ciflow/rocm-mi200/168096 -> ciflow/rocm-mi200/168096 2025-12-04T08:56:59.7681746Z * [new tag] ciflow/rocm-mi200/168275 -> ciflow/rocm-mi200/168275 2025-12-04T08:56:59.7683206Z * [new tag] ciflow/rocm-mi200/169063 -> ciflow/rocm-mi200/169063 2025-12-04T08:56:59.7684284Z * [new tag] ciflow/rocm-mi200/169356 -> ciflow/rocm-mi200/169356 2025-12-04T08:56:59.7685301Z * [new tag] ciflow/rocm-mi200/169425 -> ciflow/rocm-mi200/169425 2025-12-04T08:56:59.7686686Z * [new tag] ciflow/rocm-mi300/165545 -> ciflow/rocm-mi300/165545 2025-12-04T08:56:59.7687858Z * [new tag] ciflow/rocm-mi300/167157 -> ciflow/rocm-mi300/167157 2025-12-04T08:56:59.7688770Z * [new tag] ciflow/rocm-mi300/168096 -> ciflow/rocm-mi300/168096 2025-12-04T08:56:59.7689722Z * [new tag] ciflow/rocm-mi300/169063 -> ciflow/rocm-mi300/169063 2025-12-04T08:56:59.7690875Z * [new tag] ciflow/rocm-mi300/169425 -> ciflow/rocm-mi300/169425 2025-12-04T08:56:59.7692136Z * [new tag] ciflow/rocm-mi355/167157 -> ciflow/rocm-mi355/167157 2025-12-04T08:56:59.7693469Z * [new tag] ciflow/rocm-mi355/168275 -> ciflow/rocm-mi355/168275 2025-12-04T08:56:59.7694366Z * [new tag] ciflow/rocm-mi355/169425 -> ciflow/rocm-mi355/169425 2025-12-04T08:56:59.7695768Z * [new tag] ciflow/rocm-navi31/168275 -> ciflow/rocm-navi31/168275 2025-12-04T08:56:59.7696714Z * [new tag] ciflow/rocm-navi31/169425 -> ciflow/rocm-navi31/169425 2025-12-04T08:56:59.7698027Z * [new tag] ciflow/rocm/115316 -> ciflow/rocm/115316 2025-12-04T08:56:59.7698966Z * [new tag] ciflow/rocm/148492 -> ciflow/rocm/148492 2025-12-04T08:56:59.7699966Z * [new tag] ciflow/rocm/160685 -> ciflow/rocm/160685 2025-12-04T08:56:59.7700962Z * [new tag] ciflow/rocm/161607 -> ciflow/rocm/161607 2025-12-04T08:56:59.7701960Z * [new tag] ciflow/rocm/162052 -> ciflow/rocm/162052 2025-12-04T08:56:59.7703146Z * [new tag] ciflow/rocm/165997 -> ciflow/rocm/165997 2025-12-04T08:56:59.7704142Z * [new tag] ciflow/rocm/166165 -> ciflow/rocm/166165 2025-12-04T08:56:59.7705227Z * [new tag] ciflow/rocm/166517 -> ciflow/rocm/166517 2025-12-04T08:56:59.7706171Z * [new tag] ciflow/rocm/167207 -> ciflow/rocm/167207 2025-12-04T08:56:59.7707627Z * [new tag] ciflow/rocm/167536 -> ciflow/rocm/167536 2025-12-04T08:56:59.7708281Z * [new tag] ciflow/rocm/167781 -> ciflow/rocm/167781 2025-12-04T08:56:59.7709720Z * [new tag] ciflow/rocm/167989 -> ciflow/rocm/167989 2025-12-04T08:56:59.7711132Z * [new tag] ciflow/rocm/168073 -> ciflow/rocm/168073 2025-12-04T08:56:59.7712434Z * [new tag] ciflow/rocm/168195 -> ciflow/rocm/168195 2025-12-04T08:56:59.7713419Z * [new tag] ciflow/rocm/168939 -> ciflow/rocm/168939 2025-12-04T08:56:59.7714609Z * [new tag] ciflow/rocm/168971 -> ciflow/rocm/168971 2025-12-04T08:56:59.7715606Z * [new tag] ciflow/rocm/169024 -> ciflow/rocm/169024 2025-12-04T08:56:59.7716804Z * [new tag] ciflow/rocm/169200 -> ciflow/rocm/169200 2025-12-04T08:56:59.7717900Z * [new tag] ciflow/rocm/169216 -> ciflow/rocm/169216 2025-12-04T08:56:59.7719077Z * [new tag] ciflow/rocm/169312 -> ciflow/rocm/169312 2025-12-04T08:56:59.7720035Z * [new tag] ciflow/rocm/169380 -> ciflow/rocm/169380 2025-12-04T08:56:59.7721231Z * [new tag] ciflow/rocm/169427 -> ciflow/rocm/169427 2025-12-04T08:56:59.7722221Z * [new tag] ciflow/rocm/169455 -> ciflow/rocm/169455 2025-12-04T08:56:59.7723434Z * [new tag] ciflow/rocm/169470 -> ciflow/rocm/169470 2025-12-04T08:56:59.7724453Z * [new tag] ciflow/rocm/169471 -> ciflow/rocm/169471 2025-12-04T08:56:59.7725725Z * [new tag] ciflow/rocm/169472 -> ciflow/rocm/169472 2025-12-04T08:56:59.7726701Z * [new tag] ciflow/rocm/169514 -> ciflow/rocm/169514 2025-12-04T08:56:59.7728186Z * [new tag] ciflow/slow/01c7106 -> ciflow/slow/01c7106 2025-12-04T08:56:59.7729234Z * [new tag] ciflow/slow/0577043 -> ciflow/slow/0577043 2025-12-04T08:56:59.7730855Z * [new tag] ciflow/slow/0d5b74da0cab798fbfdb9caa53fad816999c8386-sdym -> ciflow/slow/0d5b74da0cab798fbfdb9caa53fad816999c8386-sdym 2025-12-04T08:56:59.7731757Z * [new tag] ciflow/slow/0e81104 -> ciflow/slow/0e81104 2025-12-04T08:56:59.7732964Z * [new tag] ciflow/slow/167207 -> ciflow/slow/167207 2025-12-04T08:56:59.7733948Z * [new tag] ciflow/slow/168050 -> ciflow/slow/168050 2025-12-04T08:56:59.7735257Z * [new tag] ciflow/slow/1732077 -> ciflow/slow/1732077 2025-12-04T08:56:59.7736477Z * [new tag] ciflow/slow/187eb7c -> ciflow/slow/187eb7c 2025-12-04T08:56:59.7738015Z * [new tag] ciflow/slow/1faef89 -> ciflow/slow/1faef89 2025-12-04T08:56:59.7739612Z * [new tag] ciflow/slow/3920ec1 -> ciflow/slow/3920ec1 2025-12-04T08:56:59.7741107Z * [new tag] ciflow/slow/3b7c6b2 -> ciflow/slow/3b7c6b2 2025-12-04T08:56:59.7742415Z * [new tag] ciflow/slow/59a3759 -> ciflow/slow/59a3759 2025-12-04T08:56:59.7743704Z * [new tag] ciflow/slow/70ef0bb -> ciflow/slow/70ef0bb 2025-12-04T08:56:59.7744905Z * [new tag] ciflow/slow/788ff06 -> ciflow/slow/788ff06 2025-12-04T08:56:59.7746519Z * [new tag] ciflow/slow/8751002215790a3a88750faa8f4366933e296693-sdym -> ciflow/slow/8751002215790a3a88750faa8f4366933e296693-sdym 2025-12-04T08:56:59.7747450Z * [new tag] ciflow/slow/9d85864 -> ciflow/slow/9d85864 2025-12-04T08:56:59.7748777Z * [new tag] ciflow/slow/9ffad5b -> ciflow/slow/9ffad5b 2025-12-04T08:56:59.7749985Z * [new tag] ciflow/slow/a206e8b -> ciflow/slow/a206e8b 2025-12-04T08:56:59.7751328Z * [new tag] ciflow/slow/a837609 -> ciflow/slow/a837609 2025-12-04T08:56:59.7752444Z * [new tag] ciflow/slow/af841f3 -> ciflow/slow/af841f3 2025-12-04T08:56:59.7754207Z * [new tag] ciflow/slow/da3aba1e46157c4df504b067477cdf2b3c96b194-sdym -> ciflow/slow/da3aba1e46157c4df504b067477cdf2b3c96b194-sdym 2025-12-04T08:56:59.7755175Z * [new tag] ciflow/torchbench/168175 -> ciflow/torchbench/168175 2025-12-04T08:56:59.7756515Z * [new tag] ciflow/trunk/148492 -> ciflow/trunk/148492 2025-12-04T08:56:59.7757698Z * [new tag] ciflow/trunk/157149 -> ciflow/trunk/157149 2025-12-04T08:56:59.7758619Z * [new tag] ciflow/trunk/157994 -> ciflow/trunk/157994 2025-12-04T08:56:59.7759621Z * [new tag] ciflow/trunk/159718 -> ciflow/trunk/159718 2025-12-04T08:56:59.7760690Z * [new tag] ciflow/trunk/160685 -> ciflow/trunk/160685 2025-12-04T08:56:59.7761956Z * [new tag] ciflow/trunk/160729 -> ciflow/trunk/160729 2025-12-04T08:56:59.7762972Z * [new tag] ciflow/trunk/162275 -> ciflow/trunk/162275 2025-12-04T08:56:59.7764080Z * [new tag] ciflow/trunk/162795 -> ciflow/trunk/162795 2025-12-04T08:56:59.7765045Z * [new tag] ciflow/trunk/163245 -> ciflow/trunk/163245 2025-12-04T08:56:59.7766224Z * [new tag] ciflow/trunk/163942 -> ciflow/trunk/163942 2025-12-04T08:56:59.7767133Z * [new tag] ciflow/trunk/165274 -> ciflow/trunk/165274 2025-12-04T08:56:59.7768666Z * [new tag] ciflow/trunk/165483 -> ciflow/trunk/165483 2025-12-04T08:56:59.7770073Z * [new tag] ciflow/trunk/165728 -> ciflow/trunk/165728 2025-12-04T08:56:59.7771414Z * [new tag] ciflow/trunk/165922 -> ciflow/trunk/165922 2025-12-04T08:56:59.7772777Z * [new tag] ciflow/trunk/166075 -> ciflow/trunk/166075 2025-12-04T08:56:59.7773819Z * [new tag] ciflow/trunk/166165 -> ciflow/trunk/166165 2025-12-04T08:56:59.7775012Z * [new tag] ciflow/trunk/166829 -> ciflow/trunk/166829 2025-12-04T08:56:59.7776271Z * [new tag] ciflow/trunk/166843 -> ciflow/trunk/166843 2025-12-04T08:56:59.7777408Z * [new tag] ciflow/trunk/166876 -> ciflow/trunk/166876 2025-12-04T08:56:59.7778410Z * [new tag] ciflow/trunk/167207 -> ciflow/trunk/167207 2025-12-04T08:56:59.7779611Z * [new tag] ciflow/trunk/167536 -> ciflow/trunk/167536 2025-12-04T08:56:59.7780776Z * [new tag] ciflow/trunk/167552 -> ciflow/trunk/167552 2025-12-04T08:56:59.7781741Z * [new tag] ciflow/trunk/167555 -> ciflow/trunk/167555 2025-12-04T08:56:59.7782908Z * [new tag] ciflow/trunk/167599 -> ciflow/trunk/167599 2025-12-04T08:56:59.7783879Z * [new tag] ciflow/trunk/167659 -> ciflow/trunk/167659 2025-12-04T08:56:59.7785308Z * [new tag] ciflow/trunk/167672 -> ciflow/trunk/167672 2025-12-04T08:56:59.7786278Z * [new tag] ciflow/trunk/167742 -> ciflow/trunk/167742 2025-12-04T08:56:59.7787501Z * [new tag] ciflow/trunk/167781 -> ciflow/trunk/167781 2025-12-04T08:56:59.7788708Z * [new tag] ciflow/trunk/167837 -> ciflow/trunk/167837 2025-12-04T08:56:59.7789870Z * [new tag] ciflow/trunk/167887 -> ciflow/trunk/167887 2025-12-04T08:56:59.7790863Z * [new tag] ciflow/trunk/167978 -> ciflow/trunk/167978 2025-12-04T08:56:59.7792052Z * [new tag] ciflow/trunk/168050 -> ciflow/trunk/168050 2025-12-04T08:56:59.7793055Z * [new tag] ciflow/trunk/168051 -> ciflow/trunk/168051 2025-12-04T08:56:59.7794453Z * [new tag] ciflow/trunk/168096 -> ciflow/trunk/168096 2025-12-04T08:56:59.7795279Z * [new tag] ciflow/trunk/168127 -> ciflow/trunk/168127 2025-12-04T08:56:59.7796480Z * [new tag] ciflow/trunk/168157 -> ciflow/trunk/168157 2025-12-04T08:56:59.7797422Z * [new tag] ciflow/trunk/168175 -> ciflow/trunk/168175 2025-12-04T08:56:59.7798633Z * [new tag] ciflow/trunk/168209 -> ciflow/trunk/168209 2025-12-04T08:56:59.7799818Z * [new tag] ciflow/trunk/168213 -> ciflow/trunk/168213 2025-12-04T08:56:59.7801042Z * [new tag] ciflow/trunk/168226 -> ciflow/trunk/168226 2025-12-04T08:56:59.7802034Z * [new tag] ciflow/trunk/168262 -> ciflow/trunk/168262 2025-12-04T08:56:59.7803247Z * [new tag] ciflow/trunk/168275 -> ciflow/trunk/168275 2025-12-04T08:56:59.7804431Z * [new tag] ciflow/trunk/168328 -> ciflow/trunk/168328 2025-12-04T08:56:59.7805423Z * [new tag] ciflow/trunk/168368 -> ciflow/trunk/168368 2025-12-04T08:56:59.7806630Z * [new tag] ciflow/trunk/168917 -> ciflow/trunk/168917 2025-12-04T08:56:59.7807788Z * [new tag] ciflow/trunk/168933 -> ciflow/trunk/168933 2025-12-04T08:56:59.7808987Z * [new tag] ciflow/trunk/168941 -> ciflow/trunk/168941 2025-12-04T08:56:59.7809962Z * [new tag] ciflow/trunk/168955 -> ciflow/trunk/168955 2025-12-04T08:56:59.7811163Z * [new tag] ciflow/trunk/168980 -> ciflow/trunk/168980 2025-12-04T08:56:59.7812538Z * [new tag] ciflow/trunk/169004 -> ciflow/trunk/169004 2025-12-04T08:56:59.7813755Z * [new tag] ciflow/trunk/169006 -> ciflow/trunk/169006 2025-12-04T08:56:59.7814739Z * [new tag] ciflow/trunk/169023 -> ciflow/trunk/169023 2025-12-04T08:56:59.7815966Z * [new tag] ciflow/trunk/169025 -> ciflow/trunk/169025 2025-12-04T08:56:59.7816974Z * [new tag] ciflow/trunk/169048 -> ciflow/trunk/169048 2025-12-04T08:56:59.7818180Z * [new tag] ciflow/trunk/169066 -> ciflow/trunk/169066 2025-12-04T08:56:59.7819165Z * [new tag] ciflow/trunk/169091 -> ciflow/trunk/169091 2025-12-04T08:56:59.7820855Z * [new tag] ciflow/trunk/169102 -> ciflow/trunk/169102 2025-12-04T08:56:59.7821822Z * [new tag] ciflow/trunk/169103 -> ciflow/trunk/169103 2025-12-04T08:56:59.7823145Z * [new tag] ciflow/trunk/169125 -> ciflow/trunk/169125 2025-12-04T08:56:59.7824369Z * [new tag] ciflow/trunk/169139 -> ciflow/trunk/169139 2025-12-04T08:56:59.7825584Z * [new tag] ciflow/trunk/169148 -> ciflow/trunk/169148 2025-12-04T08:56:59.7826664Z * [new tag] ciflow/trunk/169151 -> ciflow/trunk/169151 2025-12-04T08:56:59.7827867Z * [new tag] ciflow/trunk/169156 -> ciflow/trunk/169156 2025-12-04T08:56:59.7829104Z * [new tag] ciflow/trunk/169176 -> ciflow/trunk/169176 2025-12-04T08:56:59.7830250Z * [new tag] ciflow/trunk/169204 -> ciflow/trunk/169204 2025-12-04T08:56:59.7831301Z * [new tag] ciflow/trunk/169207 -> ciflow/trunk/169207 2025-12-04T08:56:59.7832469Z * [new tag] ciflow/trunk/169211 -> ciflow/trunk/169211 2025-12-04T08:56:59.7833732Z * [new tag] ciflow/trunk/169229 -> ciflow/trunk/169229 2025-12-04T08:56:59.7835029Z * [new tag] ciflow/trunk/169231 -> ciflow/trunk/169231 2025-12-04T08:56:59.7836116Z * [new tag] ciflow/trunk/169260 -> ciflow/trunk/169260 2025-12-04T08:56:59.7837553Z * [new tag] ciflow/trunk/169271 -> ciflow/trunk/169271 2025-12-04T08:56:59.7838633Z * [new tag] ciflow/trunk/169280 -> ciflow/trunk/169280 2025-12-04T08:56:59.7839666Z * [new tag] ciflow/trunk/169281 -> ciflow/trunk/169281 2025-12-04T08:56:59.7840904Z * [new tag] ciflow/trunk/169286 -> ciflow/trunk/169286 2025-12-04T08:56:59.7842133Z * [new tag] ciflow/trunk/169293 -> ciflow/trunk/169293 2025-12-04T08:56:59.7843111Z * [new tag] ciflow/trunk/169296 -> ciflow/trunk/169296 2025-12-04T08:56:59.7844341Z * [new tag] ciflow/trunk/169304 -> ciflow/trunk/169304 2025-12-04T08:56:59.7845330Z * [new tag] ciflow/trunk/169305 -> ciflow/trunk/169305 2025-12-04T08:56:59.7846552Z * [new tag] ciflow/trunk/169312 -> ciflow/trunk/169312 2025-12-04T08:56:59.7847871Z * [new tag] ciflow/trunk/169328 -> ciflow/trunk/169328 2025-12-04T08:56:59.7848937Z * [new tag] ciflow/trunk/169343 -> ciflow/trunk/169343 2025-12-04T08:56:59.7850135Z * [new tag] ciflow/trunk/169355 -> ciflow/trunk/169355 2025-12-04T08:56:59.7851305Z * [new tag] ciflow/trunk/169370 -> ciflow/trunk/169370 2025-12-04T08:56:59.7852627Z * [new tag] ciflow/trunk/169379 -> ciflow/trunk/169379 2025-12-04T08:56:59.7853840Z * [new tag] ciflow/trunk/169380 -> ciflow/trunk/169380 2025-12-04T08:56:59.7854848Z * [new tag] ciflow/trunk/169385 -> ciflow/trunk/169385 2025-12-04T08:56:59.7856073Z * [new tag] ciflow/trunk/169387 -> ciflow/trunk/169387 2025-12-04T08:56:59.7857324Z * [new tag] ciflow/trunk/169410 -> ciflow/trunk/169410 2025-12-04T08:56:59.7858328Z * [new tag] ciflow/trunk/169412 -> ciflow/trunk/169412 2025-12-04T08:56:59.7859560Z * [new tag] ciflow/trunk/169418 -> ciflow/trunk/169418 2025-12-04T08:56:59.7860577Z * [new tag] ciflow/trunk/169423 -> ciflow/trunk/169423 2025-12-04T08:56:59.7861915Z * [new tag] ciflow/trunk/169427 -> ciflow/trunk/169427 2025-12-04T08:56:59.7863035Z * [new tag] ciflow/trunk/169430 -> ciflow/trunk/169430 2025-12-04T08:56:59.7864209Z * [new tag] ciflow/trunk/169437 -> ciflow/trunk/169437 2025-12-04T08:56:59.7865206Z * [new tag] ciflow/trunk/169442 -> ciflow/trunk/169442 2025-12-04T08:56:59.7866440Z * [new tag] ciflow/trunk/169452 -> ciflow/trunk/169452 2025-12-04T08:56:59.7867446Z * [new tag] ciflow/trunk/169454 -> ciflow/trunk/169454 2025-12-04T08:56:59.7868675Z * [new tag] ciflow/trunk/169459 -> ciflow/trunk/169459 2025-12-04T08:56:59.7869881Z * [new tag] ciflow/trunk/169474 -> ciflow/trunk/169474 2025-12-04T08:56:59.7870889Z * [new tag] ciflow/trunk/169475 -> ciflow/trunk/169475 2025-12-04T08:56:59.7872097Z * [new tag] ciflow/trunk/169476 -> ciflow/trunk/169476 2025-12-04T08:56:59.7873313Z * [new tag] ciflow/trunk/169487 -> ciflow/trunk/169487 2025-12-04T08:56:59.7874448Z * [new tag] ciflow/trunk/169497 -> ciflow/trunk/169497 2025-12-04T08:56:59.7875616Z * [new tag] ciflow/trunk/169503 -> ciflow/trunk/169503 2025-12-04T08:56:59.7876626Z * [new tag] ciflow/trunk/169505 -> ciflow/trunk/169505 2025-12-04T08:56:59.7877854Z * [new tag] ciflow/trunk/169507 -> ciflow/trunk/169507 2025-12-04T08:56:59.7878844Z * [new tag] ciflow/trunk/169514 -> ciflow/trunk/169514 2025-12-04T08:56:59.7880056Z * [new tag] ciflow/trunk/169517 -> ciflow/trunk/169517 2025-12-04T08:56:59.7881308Z * [new tag] ciflow/trunk/169519 -> ciflow/trunk/169519 2025-12-04T08:56:59.7882183Z * [new tag] ciflow/trunk/169528 -> ciflow/trunk/169528 2025-12-04T08:56:59.7883467Z * [new tag] ciflow/trunk/169541 -> ciflow/trunk/169541 2025-12-04T08:56:59.7884722Z * [new tag] ciflow/trunk/169555 -> ciflow/trunk/169555 2025-12-04T08:56:59.7886235Z * [new tag] ciflow/unstable/123 -> ciflow/unstable/123 2025-12-04T08:56:59.7887601Z * [new tag] ciflow/vllm/165270 -> ciflow/vllm/165270 2025-12-04T08:56:59.7888570Z * [new tag] ciflow/vllm/165274 -> ciflow/vllm/165274 2025-12-04T08:56:59.7889762Z * [new tag] ciflow/vllm/166494 -> ciflow/vllm/166494 2025-12-04T08:56:59.7890715Z * [new tag] ciflow/vllm/169219 -> ciflow/vllm/169219 2025-12-04T08:56:59.7891865Z * [new tag] ciflow/vllm/169220 -> ciflow/vllm/169220 2025-12-04T08:56:59.7893270Z * [new tag] ciflow/xpu/157994 -> ciflow/xpu/157994 2025-12-04T08:56:59.7894804Z * [new tag] ciflow/xpu/159718 -> ciflow/xpu/159718 2025-12-04T08:56:59.7895746Z * [new tag] ciflow/xpu/161940 -> ciflow/xpu/161940 2025-12-04T08:56:59.7897050Z * [new tag] ciflow/xpu/163251 -> ciflow/xpu/163251 2025-12-04T08:56:59.7897986Z * [new tag] ciflow/xpu/166829 -> ciflow/xpu/166829 2025-12-04T08:56:59.7899133Z * [new tag] ciflow/xpu/166843 -> ciflow/xpu/166843 2025-12-04T08:56:59.7900082Z * [new tag] ciflow/xpu/167972 -> ciflow/xpu/167972 2025-12-04T08:56:59.7901210Z * [new tag] ciflow/xpu/167981 -> ciflow/xpu/167981 2025-12-04T08:56:59.7902146Z * [new tag] ciflow/xpu/168213 -> ciflow/xpu/168213 2025-12-04T08:56:59.7903322Z * [new tag] ciflow/xpu/168262 -> ciflow/xpu/168262 2025-12-04T08:56:59.7904287Z * [new tag] ciflow/xpu/168328 -> ciflow/xpu/168328 2025-12-04T08:56:59.7905698Z * [new tag] ciflow/xpu/168950 -> ciflow/xpu/168950 2025-12-04T08:56:59.7907231Z * [new tag] ciflow/xpu/169039 -> ciflow/xpu/169039 2025-12-04T08:56:59.7908550Z * [new tag] ciflow/xpu/169200 -> ciflow/xpu/169200 2025-12-04T08:56:59.7909747Z * [new tag] ciflow/xpu/169203 -> ciflow/xpu/169203 2025-12-04T08:56:59.7910710Z * [new tag] ciflow/xpu/169229 -> ciflow/xpu/169229 2025-12-04T08:56:59.7911915Z * [new tag] ciflow/xpu/169230 -> ciflow/xpu/169230 2025-12-04T08:56:59.7912901Z * [new tag] ciflow/xpu/169231 -> ciflow/xpu/169231 2025-12-04T08:56:59.7914259Z * [new tag] ciflow/xpu/169241 -> ciflow/xpu/169241 2025-12-04T08:56:59.7915238Z * [new tag] ciflow/xpu/169280 -> ciflow/xpu/169280 2025-12-04T08:56:59.7916439Z * [new tag] ciflow/xpu/169296 -> ciflow/xpu/169296 2025-12-04T08:56:59.7917663Z * [new tag] ciflow/xpu/169353 -> ciflow/xpu/169353 2025-12-04T08:56:59.7918766Z * [new tag] ciflow/xpu/169410 -> ciflow/xpu/169410 2025-12-04T08:56:59.7919762Z * [new tag] ciflow/xpu/169442 -> ciflow/xpu/169442 2025-12-04T08:56:59.7921033Z * [new tag] ciflow/xpu/169555 -> ciflow/xpu/169555 2025-12-04T08:56:59.7922252Z * [new tag] cslpull75 -> cslpull75 2025-12-04T08:56:59.7923439Z * [new tag] cslpull76 -> cslpull76 2025-12-04T08:56:59.7924410Z * [new tag] cslpull77 -> cslpull77 2025-12-04T08:56:59.7925566Z * [new tag] cslpull78 -> cslpull78 2025-12-04T08:56:59.7927085Z * [new tag] cslpull79 -> cslpull79 2025-12-04T08:56:59.7928533Z * [new tag] cslpull80 -> cslpull80 2025-12-04T08:56:59.7929716Z * [new tag] cslpull81 -> cslpull81 2025-12-04T08:56:59.7930856Z * [new tag] cslpull82 -> cslpull82 2025-12-04T08:56:59.7932081Z * [new tag] cslpull83 -> cslpull83 2025-12-04T08:56:59.7933397Z * [new tag] cslpull84 -> cslpull84 2025-12-04T08:56:59.7934378Z * [new tag] cslpull85 -> cslpull85 2025-12-04T08:56:59.7935672Z * [new tag] cslpull86 -> cslpull86 2025-12-04T08:56:59.7936847Z * [new tag] cslpull87 -> cslpull87 2025-12-04T08:56:59.7938102Z * [new tag] cslpull88 -> cslpull88 2025-12-04T08:56:59.7939242Z * [new tag] cslpull89 -> cslpull89 2025-12-04T08:56:59.7940159Z * [new tag] cslpull90 -> cslpull90 2025-12-04T08:56:59.7941704Z * [new tag] cslpull91 -> cslpull91 2025-12-04T08:56:59.7942870Z * [new tag] cslpull92 -> cslpull92 2025-12-04T08:56:59.7944124Z * [new tag] flight_5 -> flight_5 2025-12-04T08:56:59.7945477Z * [new tag] flight_5.1 -> flight_5.1 2025-12-04T08:56:59.7946720Z * [new tag] flight_5.2 -> flight_5.2 2025-12-04T08:56:59.7947991Z * [new tag] flight_5.3 -> flight_5.3 2025-12-04T08:56:59.7949181Z * [new tag] forpull1 -> forpull1 2025-12-04T08:56:59.7950573Z * [new tag] malfet/tag-2ef5611 -> malfet/tag-2ef5611 2025-12-04T08:56:59.7951848Z * [new tag] malfet/tag-317b1a0 -> malfet/tag-317b1a0 2025-12-04T08:56:59.7952966Z * [new tag] malfet/tag-ec6f767 -> malfet/tag-ec6f767 2025-12-04T08:56:59.7954249Z * [new tag] nightly-binary -> nightly-binary 2025-12-04T08:56:59.7955450Z * [new tag] sqzhang_flight4_plus -> sqzhang_flight4_plus 2025-12-04T08:56:59.7956725Z * [new tag] sqzhang_flight_3 -> sqzhang_flight_3 2025-12-04T08:56:59.7958324Z * [new tag] trunk/02d8bd6974cf84b721680d773dbdb1b6f40ce272 -> trunk/02d8bd6974cf84b721680d773dbdb1b6f40ce272 2025-12-04T08:56:59.7959541Z * [new tag] trunk/066997fb38ade71e00d78e9d572e380b5f02bd3e -> trunk/066997fb38ade71e00d78e9d572e380b5f02bd3e 2025-12-04T08:56:59.7961134Z * [new tag] trunk/076e7b19fa1d481ad778d06d2b49ba57d3ce8c88 -> trunk/076e7b19fa1d481ad778d06d2b49ba57d3ce8c88 2025-12-04T08:56:59.7962610Z * [new tag] trunk/07dcc0b83db3211653a38565a24e15acdba75654 -> trunk/07dcc0b83db3211653a38565a24e15acdba75654 2025-12-04T08:56:59.7963875Z * [new tag] trunk/082e96b68dfcd16cab7cfafc4d3d055767dab3eb -> trunk/082e96b68dfcd16cab7cfafc4d3d055767dab3eb 2025-12-04T08:56:59.7965073Z * [new tag] trunk/088048f2fea28ff7d450f65c72419ca45780d30b -> trunk/088048f2fea28ff7d450f65c72419ca45780d30b 2025-12-04T08:56:59.7966316Z * [new tag] trunk/09076941a95c76f4d9ad189d064dfd8baa39e672 -> trunk/09076941a95c76f4d9ad189d064dfd8baa39e672 2025-12-04T08:56:59.7967505Z * [new tag] trunk/0b80a4c62b94402844bf221791c096b0035c6d75 -> trunk/0b80a4c62b94402844bf221791c096b0035c6d75 2025-12-04T08:56:59.7969086Z * [new tag] trunk/0bbbdf1750567a980634ad907a325357ba8ba8f2 -> trunk/0bbbdf1750567a980634ad907a325357ba8ba8f2 2025-12-04T08:56:59.7970186Z * [new tag] trunk/0c281dd78773b2bc17c58ead0e4cd4ac46e775c5 -> trunk/0c281dd78773b2bc17c58ead0e4cd4ac46e775c5 2025-12-04T08:56:59.7971626Z * [new tag] trunk/135f3753c418a6879b1954904184937b67e61688 -> trunk/135f3753c418a6879b1954904184937b67e61688 2025-12-04T08:56:59.7972874Z * [new tag] trunk/15da21026cb13cd20257dc9e96830db108743c10 -> trunk/15da21026cb13cd20257dc9e96830db108743c10 2025-12-04T08:56:59.7974089Z * [new tag] trunk/166efdad2ac827f30fb02504c6017520257f88ec -> trunk/166efdad2ac827f30fb02504c6017520257f88ec 2025-12-04T08:56:59.7975284Z * [new tag] trunk/174272c15fae553d8488140af931f7d8050a313f -> trunk/174272c15fae553d8488140af931f7d8050a313f 2025-12-04T08:56:59.7976814Z * [new tag] trunk/18f3ca08f13b8de61307f5e8cd7d4cccb67e9d11 -> trunk/18f3ca08f13b8de61307f5e8cd7d4cccb67e9d11 2025-12-04T08:56:59.7977972Z * [new tag] trunk/1902eddfe655a15ebcf2c72bd81ade110fdeef63 -> trunk/1902eddfe655a15ebcf2c72bd81ade110fdeef63 2025-12-04T08:56:59.7979155Z * [new tag] trunk/195f92e98d3d66738577f11f22c4b5c8a1c76dd5 -> trunk/195f92e98d3d66738577f11f22c4b5c8a1c76dd5 2025-12-04T08:56:59.7980367Z * [new tag] trunk/1aa13e17de39e3c768ea7aebaad166ce72a06676 -> trunk/1aa13e17de39e3c768ea7aebaad166ce72a06676 2025-12-04T08:56:59.7981571Z * [new tag] trunk/1afe2832f58e24e54a5bfda5a5afa9b96fdea40e -> trunk/1afe2832f58e24e54a5bfda5a5afa9b96fdea40e 2025-12-04T08:56:59.7982857Z * [new tag] trunk/1c87554d74140eaee964ca8b1832cede67f5f520 -> trunk/1c87554d74140eaee964ca8b1832cede67f5f520 2025-12-04T08:56:59.7984076Z * [new tag] trunk/1ccb743b7b5be955f49736c162c4f5004b8a0dd8 -> trunk/1ccb743b7b5be955f49736c162c4f5004b8a0dd8 2025-12-04T08:56:59.7985333Z * [new tag] trunk/1cee47d6ce0a02227185b566593f002dd639ca0c -> trunk/1cee47d6ce0a02227185b566593f002dd639ca0c 2025-12-04T08:56:59.7986420Z * [new tag] trunk/1d21b4df2babe322e5d085ceb6de884eb260a62d -> trunk/1d21b4df2babe322e5d085ceb6de884eb260a62d 2025-12-04T08:56:59.7987694Z * [new tag] trunk/1e34fb2550e4aa650314f7a6d9f6daf4da7478a8 -> trunk/1e34fb2550e4aa650314f7a6d9f6daf4da7478a8 2025-12-04T08:56:59.7989002Z * [new tag] trunk/1e526fb5b1d93bfc70691c5c3955fdffc1b7b7de -> trunk/1e526fb5b1d93bfc70691c5c3955fdffc1b7b7de 2025-12-04T08:56:59.7990204Z * [new tag] trunk/1ee32a8b1f554a312d79bad01ded24f38cd95543 -> trunk/1ee32a8b1f554a312d79bad01ded24f38cd95543 2025-12-04T08:56:59.7991435Z * [new tag] trunk/201e2c4117eb9744594dad6a5c18213d7b4705d7 -> trunk/201e2c4117eb9744594dad6a5c18213d7b4705d7 2025-12-04T08:56:59.7992661Z * [new tag] trunk/2353a0f60eb4b4cb6675907a7fa9fbedc1c02e7f -> trunk/2353a0f60eb4b4cb6675907a7fa9fbedc1c02e7f 2025-12-04T08:56:59.7994140Z * [new tag] trunk/285779b1621cf9f073a062b0889a642d200308d9 -> trunk/285779b1621cf9f073a062b0889a642d200308d9 2025-12-04T08:56:59.7995156Z * [new tag] trunk/2887faaec6295d081580d09fce161201826c6d87 -> trunk/2887faaec6295d081580d09fce161201826c6d87 2025-12-04T08:56:59.7996404Z * [new tag] trunk/296e67c92635443c67b11c0ae1bd045f03ebb7bc -> trunk/296e67c92635443c67b11c0ae1bd045f03ebb7bc 2025-12-04T08:56:59.7997578Z * [new tag] trunk/29856679769b3dede478767e2fe6cfb51197cb25 -> trunk/29856679769b3dede478767e2fe6cfb51197cb25 2025-12-04T08:56:59.7998807Z * [new tag] trunk/29e5455a4740c326ab187c7aa7b5ef98034ea563 -> trunk/29e5455a4740c326ab187c7aa7b5ef98034ea563 2025-12-04T08:56:59.8000278Z * [new tag] trunk/2ac3ef882afb23136adc188975f0a8802fc68adf -> trunk/2ac3ef882afb23136adc188975f0a8802fc68adf 2025-12-04T08:56:59.8001257Z * [new tag] trunk/2bec68e73b64715354af076ad309335f943e36cd -> trunk/2bec68e73b64715354af076ad309335f943e36cd 2025-12-04T08:56:59.8002461Z * [new tag] trunk/2c87367e6f88662cd5cedbd1537748b7948c38e1 -> trunk/2c87367e6f88662cd5cedbd1537748b7948c38e1 2025-12-04T08:56:59.8003791Z * [new tag] trunk/2d1f78fe3ec13820f136a2e0336da12a25f41708 -> trunk/2d1f78fe3ec13820f136a2e0336da12a25f41708 2025-12-04T08:56:59.8004985Z * [new tag] trunk/2df6058f116a65722a0e03073402feb242572d35 -> trunk/2df6058f116a65722a0e03073402feb242572d35 2025-12-04T08:56:59.8006236Z * [new tag] trunk/2e0c2e170fe658c440775c8e5c44228aafcc47ec -> trunk/2e0c2e170fe658c440775c8e5c44228aafcc47ec 2025-12-04T08:56:59.8007707Z * [new tag] trunk/2f9b7dad7b5419b063bd0f2e204de192720ebb94 -> trunk/2f9b7dad7b5419b063bd0f2e204de192720ebb94 2025-12-04T08:56:59.8008861Z * [new tag] trunk/305168768a95d69c444df5cd334bb774edfe06f1 -> trunk/305168768a95d69c444df5cd334bb774edfe06f1 2025-12-04T08:56:59.8010071Z * [new tag] trunk/31fc12773026e8e00f054dd79ad9b2491e693b48 -> trunk/31fc12773026e8e00f054dd79ad9b2491e693b48 2025-12-04T08:56:59.8011273Z * [new tag] trunk/320de0c6b0a3e7c6d2693ea5c28d5d0156ba7991 -> trunk/320de0c6b0a3e7c6d2693ea5c28d5d0156ba7991 2025-12-04T08:56:59.8013214Z * [new tag] trunk/3418bd29475dff06695045fcdf93e7d0dac67da8 -> trunk/3418bd29475dff06695045fcdf93e7d0dac67da8 2025-12-04T08:56:59.8014449Z * [new tag] trunk/34a98608afa0cb5b48f0d6d30432fdd0a2614ddf -> trunk/34a98608afa0cb5b48f0d6d30432fdd0a2614ddf 2025-12-04T08:56:59.8015834Z * [new tag] trunk/35b7a9a26c5923d98aebaa41a031dae21788a9ee -> trunk/35b7a9a26c5923d98aebaa41a031dae21788a9ee 2025-12-04T08:56:59.8016975Z * [new tag] trunk/39d07dbf03a911bdd45d1af78d8638dc92074938 -> trunk/39d07dbf03a911bdd45d1af78d8638dc92074938 2025-12-04T08:56:59.8018113Z * [new tag] trunk/3cd98b4205ada151042cc7ff097a82d4a4b18725 -> trunk/3cd98b4205ada151042cc7ff097a82d4a4b18725 2025-12-04T08:56:59.8019295Z * [new tag] trunk/3d35fd20a78ff4d016fa80f4e5fad37191d7bcae -> trunk/3d35fd20a78ff4d016fa80f4e5fad37191d7bcae 2025-12-04T08:56:59.8020545Z * [new tag] trunk/409a5fee945c46a3edaf5df162812f201bfd7b2f -> trunk/409a5fee945c46a3edaf5df162812f201bfd7b2f 2025-12-04T08:56:59.8021759Z * [new tag] trunk/42e9005cda22da3f1c559c3649218cebd671027c -> trunk/42e9005cda22da3f1c559c3649218cebd671027c 2025-12-04T08:56:59.8023039Z * [new tag] trunk/43b94713bbf340d3c124fde02d0f73add4021247 -> trunk/43b94713bbf340d3c124fde02d0f73add4021247 2025-12-04T08:56:59.8024268Z * [new tag] trunk/44ac69388a4a5eb463dbd2a13f00d1e3b924566c -> trunk/44ac69388a4a5eb463dbd2a13f00d1e3b924566c 2025-12-04T08:56:59.8025491Z * [new tag] trunk/45d14e2497292be06ad36eaa1aaaf7c630a2586a -> trunk/45d14e2497292be06ad36eaa1aaaf7c630a2586a 2025-12-04T08:56:59.8026661Z * [new tag] trunk/45d310ad84854dff730c0b12e577d7998d978686 -> trunk/45d310ad84854dff730c0b12e577d7998d978686 2025-12-04T08:56:59.8028255Z * [new tag] trunk/47b28ddf7bd74b50fa93b307a7d3b183a6d77f54 -> trunk/47b28ddf7bd74b50fa93b307a7d3b183a6d77f54 2025-12-04T08:56:59.8029233Z * [new tag] trunk/481e5ab336275bd3acd5fa8a611b05b4469012af -> trunk/481e5ab336275bd3acd5fa8a611b05b4469012af 2025-12-04T08:56:59.8030528Z * [new tag] trunk/491731647f6b8a9345dcfb3bc9416aea254a7d96 -> trunk/491731647f6b8a9345dcfb3bc9416aea254a7d96 2025-12-04T08:56:59.8031760Z * [new tag] trunk/49a04d26088acc17d948ddd66920f3e16371e873 -> trunk/49a04d26088acc17d948ddd66920f3e16371e873 2025-12-04T08:56:59.8033028Z * [new tag] trunk/4bebc827c47d2f1f0fa1a417a5201a97aef3d985 -> trunk/4bebc827c47d2f1f0fa1a417a5201a97aef3d985 2025-12-04T08:56:59.8034104Z * [new tag] trunk/4c246677784c6a14bc2dbb9ff8773ef0a3a3222f -> trunk/4c246677784c6a14bc2dbb9ff8773ef0a3a3222f 2025-12-04T08:56:59.8035589Z * [new tag] trunk/4cfb47ff548b6d996641058cf04a70e311a4c3aa -> trunk/4cfb47ff548b6d996641058cf04a70e311a4c3aa 2025-12-04T08:56:59.8036680Z * [new tag] trunk/4e0061c1aa52f606dda8cfab0bd7591e588faf2c -> trunk/4e0061c1aa52f606dda8cfab0bd7591e588faf2c 2025-12-04T08:56:59.8038531Z * [new tag] trunk/4fefb8e7e942386ffac764a41b232241f82bea3a -> trunk/4fefb8e7e942386ffac764a41b232241f82bea3a 2025-12-04T08:56:59.8039667Z * [new tag] trunk/503b2640023521f5a35cd9a52fc8033d73a95d0d -> trunk/503b2640023521f5a35cd9a52fc8033d73a95d0d 2025-12-04T08:56:59.8040906Z * [new tag] trunk/518c2b1b3dab9a2ef2849e04b3bc2f20c1c41db9 -> trunk/518c2b1b3dab9a2ef2849e04b3bc2f20c1c41db9 2025-12-04T08:56:59.8042151Z * [new tag] trunk/5191b2fa68ba19960912bfd7fd721c79d76bb1f3 -> trunk/5191b2fa68ba19960912bfd7fd721c79d76bb1f3 2025-12-04T08:56:59.8043387Z * [new tag] trunk/52ac0f0dc4acacd219f1317fbc28ec631c01e07a -> trunk/52ac0f0dc4acacd219f1317fbc28ec631c01e07a 2025-12-04T08:56:59.8044720Z * [new tag] trunk/539ba711b029de9f191070f4f0d12f18f5b7f292 -> trunk/539ba711b029de9f191070f4f0d12f18f5b7f292 2025-12-04T08:56:59.8045956Z * [new tag] trunk/556375b55deebebbc56cb7aef81f4d52f031ba28 -> trunk/556375b55deebebbc56cb7aef81f4d52f031ba28 2025-12-04T08:56:59.8047247Z * [new tag] trunk/55c4ab554845481d0a69a3811937575fe8bb1a66 -> trunk/55c4ab554845481d0a69a3811937575fe8bb1a66 2025-12-04T08:56:59.8048531Z * [new tag] trunk/5634469fda9e5d98869c82c7d03bb08914245f96 -> trunk/5634469fda9e5d98869c82c7d03bb08914245f96 2025-12-04T08:56:59.8049586Z * [new tag] trunk/5778f6ff894686a975a9a23645178ae4c87ad5dc -> trunk/5778f6ff894686a975a9a23645178ae4c87ad5dc 2025-12-04T08:56:59.8050874Z * [new tag] trunk/587d63a3e07de5dc91065f9ef70bcacda9989068 -> trunk/587d63a3e07de5dc91065f9ef70bcacda9989068 2025-12-04T08:56:59.8052133Z * [new tag] trunk/597930f6b568852356ca9795dac76f9e4653adbd -> trunk/597930f6b568852356ca9795dac76f9e4653adbd 2025-12-04T08:56:59.8053433Z * [new tag] trunk/597df3a4e2a67b9fdbe1a89b2f4d74f822274db6 -> trunk/597df3a4e2a67b9fdbe1a89b2f4d74f822274db6 2025-12-04T08:56:59.8054830Z * [new tag] trunk/59abd50e931f4efb21b053f7a2911f5d8a49d883 -> trunk/59abd50e931f4efb21b053f7a2911f5d8a49d883 2025-12-04T08:56:59.8056008Z * [new tag] trunk/5a607febc04c3a2b5824c75f3f60307867439a2c -> trunk/5a607febc04c3a2b5824c75f3f60307867439a2c 2025-12-04T08:56:59.8057279Z * [new tag] trunk/5bf1cdf4755c54ef462b44cb8041b0a57311556b -> trunk/5bf1cdf4755c54ef462b44cb8041b0a57311556b 2025-12-04T08:56:59.8058359Z * [new tag] trunk/5f0030ba63d334d7e8c93a09e41403b89e4c573c -> trunk/5f0030ba63d334d7e8c93a09e41403b89e4c573c 2025-12-04T08:56:59.8059640Z * [new tag] trunk/5f21d27e71268464d362a96c9ac09ea475f7f202 -> trunk/5f21d27e71268464d362a96c9ac09ea475f7f202 2025-12-04T08:56:59.8061210Z * [new tag] trunk/5fafc13038c9988d9ac21fa793fbd5890604b447 -> trunk/5fafc13038c9988d9ac21fa793fbd5890604b447 2025-12-04T08:56:59.8064287Z * [new tag] trunk/61be54a31dc09b59d99b62176fb935aee0b924ef -> trunk/61be54a31dc09b59d99b62176fb935aee0b924ef 2025-12-04T08:56:59.8065507Z * [new tag] trunk/62d3ccd71484ed6a760d909b41487101bbc65719 -> trunk/62d3ccd71484ed6a760d909b41487101bbc65719 2025-12-04T08:56:59.8066703Z * [new tag] trunk/641cdb68ae27668eb441d0e49c87a0602c120c2b -> trunk/641cdb68ae27668eb441d0e49c87a0602c120c2b 2025-12-04T08:56:59.8067916Z * [new tag] trunk/65c4620d6bb0c6029f69762c22b91dda2294da9a -> trunk/65c4620d6bb0c6029f69762c22b91dda2294da9a 2025-12-04T08:56:59.8069172Z * [new tag] trunk/66004b993744b4106bf8afaba71f3c228a804206 -> trunk/66004b993744b4106bf8afaba71f3c228a804206 2025-12-04T08:56:59.8070434Z * [new tag] trunk/6658a04c7ca67acb64512341342e7b3ee13ee386 -> trunk/6658a04c7ca67acb64512341342e7b3ee13ee386 2025-12-04T08:56:59.8073119Z * [new tag] trunk/6864e309092a71f8ab0ca6a4dc7f8a4073fd31c4 -> trunk/6864e309092a71f8ab0ca6a4dc7f8a4073fd31c4 2025-12-04T08:56:59.8073958Z * [new tag] trunk/6c261c6cb07892c90ca19ed51c9705b1659a3f7d -> trunk/6c261c6cb07892c90ca19ed51c9705b1659a3f7d 2025-12-04T08:56:59.8074890Z * [new tag] trunk/6c8b6a043f1628188b6396b3a2a6e000ca68362b -> trunk/6c8b6a043f1628188b6396b3a2a6e000ca68362b 2025-12-04T08:56:59.8076085Z * [new tag] trunk/6ceb4a32f92ae67ce5d7d97931d17401ebf5ffa5 -> trunk/6ceb4a32f92ae67ce5d7d97931d17401ebf5ffa5 2025-12-04T08:56:59.8077254Z * [new tag] trunk/6e404e9b7d6f5fb0de86aa73888c3038248c17f8 -> trunk/6e404e9b7d6f5fb0de86aa73888c3038248c17f8 2025-12-04T08:56:59.8078571Z * [new tag] trunk/6ec30b490aee1db6bcdc7340abddef25784f08ec -> trunk/6ec30b490aee1db6bcdc7340abddef25784f08ec 2025-12-04T08:56:59.8079795Z * [new tag] trunk/6f2783a6c08e1db34275ff25176ffe9aebc30a71 -> trunk/6f2783a6c08e1db34275ff25176ffe9aebc30a71 2025-12-04T08:56:59.8081066Z * [new tag] trunk/6f53fefeb90ad3281119b5cfc4aa9ffd8a066e3d -> trunk/6f53fefeb90ad3281119b5cfc4aa9ffd8a066e3d 2025-12-04T08:56:59.8082294Z * [new tag] trunk/6f7dcf51e46d0c880db1a2f5c70de57adb576f4a -> trunk/6f7dcf51e46d0c880db1a2f5c70de57adb576f4a 2025-12-04T08:56:59.8083649Z * [new tag] trunk/6ff831180d2fa436c7f1c1af3adac641fce9d60e -> trunk/6ff831180d2fa436c7f1c1af3adac641fce9d60e 2025-12-04T08:56:59.8084881Z * [new tag] trunk/70076464a63ab218a7ceefb0e76ccd7131deb8f8 -> trunk/70076464a63ab218a7ceefb0e76ccd7131deb8f8 2025-12-04T08:56:59.8086405Z * [new tag] trunk/70d797a5fc109b20a517646fcaa819477cd0d485 -> trunk/70d797a5fc109b20a517646fcaa819477cd0d485 2025-12-04T08:56:59.8087483Z * [new tag] trunk/7348cb355ff0a6f79cd4871215aea72185748734 -> trunk/7348cb355ff0a6f79cd4871215aea72185748734 2025-12-04T08:56:59.8088771Z * [new tag] trunk/74fe26a1ebe32931783569f2e762e3c2c974901f -> trunk/74fe26a1ebe32931783569f2e762e3c2c974901f 2025-12-04T08:56:59.8090052Z * [new tag] trunk/76aeb8c7e0f795b3fddca134cbea9a69da3ee696 -> trunk/76aeb8c7e0f795b3fddca134cbea9a69da3ee696 2025-12-04T08:56:59.8091306Z * [new tag] trunk/7741edd4ed665f3988052e260863efb508d61a03 -> trunk/7741edd4ed665f3988052e260863efb508d61a03 2025-12-04T08:56:59.8092659Z * [new tag] trunk/78adb3b3df41b45d2368b67226d2f864b78939a6 -> trunk/78adb3b3df41b45d2368b67226d2f864b78939a6 2025-12-04T08:56:59.8094149Z * [new tag] trunk/79d7b178225e5ed24d4e1db74e5abbff848f5fb7 -> trunk/79d7b178225e5ed24d4e1db74e5abbff848f5fb7 2025-12-04T08:56:59.8095118Z * [new tag] trunk/7a1e316115fc6996b3f2336822ba5d5f6179f0c3 -> trunk/7a1e316115fc6996b3f2336822ba5d5f6179f0c3 2025-12-04T08:56:59.8096879Z * [new tag] trunk/7a41b66367c38d0af3e8a90f7be48d6b281e7bca -> trunk/7a41b66367c38d0af3e8a90f7be48d6b281e7bca 2025-12-04T08:56:59.8097995Z * [new tag] trunk/7b7af390ea8541c611d1ce2018a6934188fc197b -> trunk/7b7af390ea8541c611d1ce2018a6934188fc197b 2025-12-04T08:56:59.8099261Z * [new tag] trunk/7ba4680f3755a560af81aa0f688791e367aa3609 -> trunk/7ba4680f3755a560af81aa0f688791e367aa3609 2025-12-04T08:56:59.8100737Z * [new tag] trunk/7bc2a66ded06a0b2549aa51d807edc5dc3e73d1b -> trunk/7bc2a66ded06a0b2549aa51d807edc5dc3e73d1b 2025-12-04T08:56:59.8101719Z * [new tag] trunk/7c648509a7470ace9fb2bae960dd4790f7e943e9 -> trunk/7c648509a7470ace9fb2bae960dd4790f7e943e9 2025-12-04T08:56:59.8102861Z * [new tag] trunk/7cbc2d034cecd21ab5c9707d0a9c525c17143fb8 -> trunk/7cbc2d034cecd21ab5c9707d0a9c525c17143fb8 2025-12-04T08:56:59.8104145Z * [new tag] trunk/7d1bbaf4ba301ea3fba6f3c7bc02d58f6417aaed -> trunk/7d1bbaf4ba301ea3fba6f3c7bc02d58f6417aaed 2025-12-04T08:56:59.8105375Z * [new tag] trunk/7d2a33e4ebf60b217a3cd77feae19231eb996fc8 -> trunk/7d2a33e4ebf60b217a3cd77feae19231eb996fc8 2025-12-04T08:56:59.8106674Z * [new tag] trunk/7eb625920054b1126a7d2d99818aaa188c6ba95e -> trunk/7eb625920054b1126a7d2d99818aaa188c6ba95e 2025-12-04T08:56:59.8107719Z * [new tag] trunk/7f55ba19c456a3d6cc443dd9edb6bb7cca677ead -> trunk/7f55ba19c456a3d6cc443dd9edb6bb7cca677ead 2025-12-04T08:56:59.8108974Z * [new tag] trunk/81af382128efa094d8702e18f2c133760904c718 -> trunk/81af382128efa094d8702e18f2c133760904c718 2025-12-04T08:56:59.8110540Z * [new tag] trunk/84149583d483e9c973c9a0feda70e4f3964947b0 -> trunk/84149583d483e9c973c9a0feda70e4f3964947b0 2025-12-04T08:56:59.8111871Z * [new tag] trunk/85a315917efe82c24306be805c584ec044951c75 -> trunk/85a315917efe82c24306be805c584ec044951c75 2025-12-04T08:56:59.8113096Z * [new tag] trunk/87329491c82a5f8c1cc4ec11d8f55a5de2551ece -> trunk/87329491c82a5f8c1cc4ec11d8f55a5de2551ece 2025-12-04T08:56:59.8114203Z * [new tag] trunk/892640e25aeefa8007c5af837214b4502b6b62a6 -> trunk/892640e25aeefa8007c5af837214b4502b6b62a6 2025-12-04T08:56:59.8115721Z * [new tag] trunk/89e3bbcb5b5321dc8b9520b4d5a8ee60cea1d0b4 -> trunk/89e3bbcb5b5321dc8b9520b4d5a8ee60cea1d0b4 2025-12-04T08:56:59.8116847Z * [new tag] trunk/8c73bbbb02159223c0c97d268a0a74cb78158a1c -> trunk/8c73bbbb02159223c0c97d268a0a74cb78158a1c 2025-12-04T08:56:59.8118090Z * [new tag] trunk/8d56e98c8db988a22cb2dfaeefb30bc7d2a3cc43 -> trunk/8d56e98c8db988a22cb2dfaeefb30bc7d2a3cc43 2025-12-04T08:56:59.8119446Z * [new tag] trunk/8d9dd9603e5ee26c01007f0cd4f018e584840922 -> trunk/8d9dd9603e5ee26c01007f0cd4f018e584840922 2025-12-04T08:56:59.8120666Z * [new tag] trunk/8ef0c0b02b062d75e7c9be2594914a3e784d23ca -> trunk/8ef0c0b02b062d75e7c9be2594914a3e784d23ca 2025-12-04T08:56:59.8121906Z * [new tag] trunk/90b27e7e8352cde97d32ddad24740ef819633f38 -> trunk/90b27e7e8352cde97d32ddad24740ef819633f38 2025-12-04T08:56:59.8123036Z * [new tag] trunk/90f0139e64b2951815d524b6a373bed20c4fbf90 -> trunk/90f0139e64b2951815d524b6a373bed20c4fbf90 2025-12-04T08:56:59.8124244Z * [new tag] trunk/93d0d6838c56af59b0dba794e6aa08f0c1c7799c -> trunk/93d0d6838c56af59b0dba794e6aa08f0c1c7799c 2025-12-04T08:56:59.8125479Z * [new tag] trunk/94ca8d5f1e81fea3ae488650a0fb6795049a9f87 -> trunk/94ca8d5f1e81fea3ae488650a0fb6795049a9f87 2025-12-04T08:56:59.8126649Z * [new tag] trunk/9844fbeadd5cebdf1281d6fbf79164139c352693 -> trunk/9844fbeadd5cebdf1281d6fbf79164139c352693 2025-12-04T08:56:59.8127954Z * [new tag] trunk/99024dec888ec1e50b546822a32b6fb2f35e5eaa -> trunk/99024dec888ec1e50b546822a32b6fb2f35e5eaa 2025-12-04T08:56:59.8129221Z * [new tag] trunk/9a296e640fc88aa44d275b48cd9cc30c573b169d -> trunk/9a296e640fc88aa44d275b48cd9cc30c573b169d 2025-12-04T08:56:59.8130489Z * [new tag] trunk/9b3e34d8589b29f7b4e7fab6f78711b7ca6e4639 -> trunk/9b3e34d8589b29f7b4e7fab6f78711b7ca6e4639 2025-12-04T08:56:59.8131719Z * [new tag] trunk/9cd055e547e9b67a5f9827f8999c38d7eda1bcb8 -> trunk/9cd055e547e9b67a5f9827f8999c38d7eda1bcb8 2025-12-04T08:56:59.8133153Z * [new tag] trunk/9f0df5686cb4ada94f94620acba2e3c3f363b11d -> trunk/9f0df5686cb4ada94f94620acba2e3c3f363b11d 2025-12-04T08:56:59.8134397Z * [new tag] trunk/9f7fceb887d0cfa0326a59b887821c63ff11340a -> trunk/9f7fceb887d0cfa0326a59b887821c63ff11340a 2025-12-04T08:56:59.8135637Z * [new tag] trunk/9f8ef8855d3078d70f7b782540ff2aaf158d6742 -> trunk/9f8ef8855d3078d70f7b782540ff2aaf158d6742 2025-12-04T08:56:59.8137091Z * [new tag] trunk/9fb52efc797b47a1f425a03aa5e47b866d8b1098 -> trunk/9fb52efc797b47a1f425a03aa5e47b866d8b1098 2025-12-04T08:56:59.8138179Z * [new tag] trunk/9ff4a2ebc5762d46c73e46b1b523d7ff349fedfa -> trunk/9ff4a2ebc5762d46c73e46b1b523d7ff349fedfa 2025-12-04T08:56:59.8139422Z * [new tag] trunk/a0f3937b94422354538ebbd47202d5b0e8a3fd0d -> trunk/a0f3937b94422354538ebbd47202d5b0e8a3fd0d 2025-12-04T08:56:59.8141089Z * [new tag] trunk/a15066c28b3145e6edbfc88359d0411d14cfc70c -> trunk/a15066c28b3145e6edbfc88359d0411d14cfc70c 2025-12-04T08:56:59.8142159Z * [new tag] trunk/a20f775e82564d2a9979221ed7f3b8d7cf54ce90 -> trunk/a20f775e82564d2a9979221ed7f3b8d7cf54ce90 2025-12-04T08:56:59.8143368Z * [new tag] trunk/a2973fb00ec002dd4b6bbf07385f066efb259b8c -> trunk/a2973fb00ec002dd4b6bbf07385f066efb259b8c 2025-12-04T08:56:59.8144487Z * [new tag] trunk/a7dc6dab9ad911259d4801c502907e531594db45 -> trunk/a7dc6dab9ad911259d4801c502907e531594db45 2025-12-04T08:56:59.8145801Z * [new tag] trunk/a951a9cee65c01660bbc6e6fded90ecb10fa6109 -> trunk/a951a9cee65c01660bbc6e6fded90ecb10fa6109 2025-12-04T08:56:59.8147115Z * [new tag] trunk/abfa1a6d65c7c159e35c72c25979b9da4971689e -> trunk/abfa1a6d65c7c159e35c72c25979b9da4971689e 2025-12-04T08:56:59.8148441Z * [new tag] trunk/ae3a2395bf66151078e2d201716f7d63ce1c6f3e -> trunk/ae3a2395bf66151078e2d201716f7d63ce1c6f3e 2025-12-04T08:56:59.8149563Z * [new tag] trunk/afdff7f0325080dedac44d080cb5a3b0e65e6c5e -> trunk/afdff7f0325080dedac44d080cb5a3b0e65e6c5e 2025-12-04T08:56:59.8150678Z * [new tag] trunk/b1aed4e7a72c03a38f44543aaea0dae2e9b76d48 -> trunk/b1aed4e7a72c03a38f44543aaea0dae2e9b76d48 2025-12-04T08:56:59.8152121Z * [new tag] trunk/b1decff555cd50e2123c8c6e25cc0d447c411f62 -> trunk/b1decff555cd50e2123c8c6e25cc0d447c411f62 2025-12-04T08:56:59.8153342Z * [new tag] trunk/b2b6b034c9fd08672c40e63ef243556ad4c49bd2 -> trunk/b2b6b034c9fd08672c40e63ef243556ad4c49bd2 2025-12-04T08:56:59.8154744Z * [new tag] trunk/b39813b4a04931682b0491adba2138d01d716d99 -> trunk/b39813b4a04931682b0491adba2138d01d716d99 2025-12-04T08:56:59.8155940Z * [new tag] trunk/b3a7edb2311367974cc7cd764cfb11a5d6758b24 -> trunk/b3a7edb2311367974cc7cd764cfb11a5d6758b24 2025-12-04T08:56:59.8157254Z * [new tag] trunk/b4cc1329c86acaef6d42c1fac7169b8d870ab0d7 -> trunk/b4cc1329c86acaef6d42c1fac7169b8d870ab0d7 2025-12-04T08:56:59.8158559Z * [new tag] trunk/b555c39217f765759954a4f9f9bd1e9b87bed11a -> trunk/b555c39217f765759954a4f9f9bd1e9b87bed11a 2025-12-04T08:56:59.8159936Z * [new tag] trunk/b6b6c80379388b7f9932c3e6a0f9907bf430e417 -> trunk/b6b6c80379388b7f9932c3e6a0f9907bf430e417 2025-12-04T08:56:59.8161327Z * [new tag] trunk/b6b6d912df0b6f4082f8e50b18bd1de1dd7325f4 -> trunk/b6b6d912df0b6f4082f8e50b18bd1de1dd7325f4 2025-12-04T08:56:59.8162999Z * [new tag] trunk/b7d60685f8cbc939b68a20871e90db67e729329b -> trunk/b7d60685f8cbc939b68a20871e90db67e729329b 2025-12-04T08:56:59.8164155Z * [new tag] trunk/b7f6b9a4fc6259f7af068f31868b3119bb1bac3e -> trunk/b7f6b9a4fc6259f7af068f31868b3119bb1bac3e 2025-12-04T08:56:59.8165654Z * [new tag] trunk/b8c4ba3593761e7b2a3ebd86f040fb07b47c02cf -> trunk/b8c4ba3593761e7b2a3ebd86f040fb07b47c02cf 2025-12-04T08:56:59.8166800Z * [new tag] trunk/b9c8f3a4884befb965ff42620ce44a71b04887f5 -> trunk/b9c8f3a4884befb965ff42620ce44a71b04887f5 2025-12-04T08:56:59.8168265Z * [new tag] trunk/ba1412546f3082c0958c077acc2025e4dbc33f1f -> trunk/ba1412546f3082c0958c077acc2025e4dbc33f1f 2025-12-04T08:56:59.8169436Z * [new tag] trunk/bac403c0b38c63bdbcc0c31f1c2b0bc0260f610f -> trunk/bac403c0b38c63bdbcc0c31f1c2b0bc0260f610f 2025-12-04T08:56:59.8170681Z * [new tag] trunk/bb3034198b459401fabeab254e1b99f0115046e2 -> trunk/bb3034198b459401fabeab254e1b99f0115046e2 2025-12-04T08:56:59.8171955Z * [new tag] trunk/bc39b2b3bc7a6e19a42e62bd576974035086fe55 -> trunk/bc39b2b3bc7a6e19a42e62bd576974035086fe55 2025-12-04T08:56:59.8173573Z * [new tag] trunk/bc43d5b297f207a11d83d77ddf0152bdaabe15a8 -> trunk/bc43d5b297f207a11d83d77ddf0152bdaabe15a8 2025-12-04T08:56:59.8174906Z * [new tag] trunk/bc6a4863c7246a6493d16d4ea6eee71ec07c6a09 -> trunk/bc6a4863c7246a6493d16d4ea6eee71ec07c6a09 2025-12-04T08:56:59.8176047Z * [new tag] trunk/bea4912944defdbcb8b061800caab6cbbbd01df5 -> trunk/bea4912944defdbcb8b061800caab6cbbbd01df5 2025-12-04T08:56:59.8177650Z * [new tag] trunk/c04e2c656f48d82d1521b867bbbf03967b9b7564 -> trunk/c04e2c656f48d82d1521b867bbbf03967b9b7564 2025-12-04T08:56:59.8179390Z * [new tag] trunk/c0660bcee27e7d7731634e274576a7081882bede -> trunk/c0660bcee27e7d7731634e274576a7081882bede 2025-12-04T08:56:59.8180602Z * [new tag] trunk/c178ed43d3d99cbefe84fbfb21d6f282b20d62ac -> trunk/c178ed43d3d99cbefe84fbfb21d6f282b20d62ac 2025-12-04T08:56:59.8181899Z * [new tag] trunk/c55b1e8f61d041ee436d697449eb028931d574fb -> trunk/c55b1e8f61d041ee436d697449eb028931d574fb 2025-12-04T08:56:59.8183030Z * [new tag] trunk/c6ae7579fe12fe75f1a8f7043a494c90567273f1 -> trunk/c6ae7579fe12fe75f1a8f7043a494c90567273f1 2025-12-04T08:56:59.8184608Z * [new tag] trunk/c8210e7d94bad5ae21ac389fa4ba8a463c76c4d0 -> trunk/c8210e7d94bad5ae21ac389fa4ba8a463c76c4d0 2025-12-04T08:56:59.8185824Z * [new tag] trunk/cc0853af42122f8185321f542616f4474e717f09 -> trunk/cc0853af42122f8185321f542616f4474e717f09 2025-12-04T08:56:59.8187017Z * [new tag] trunk/cddec6562eabfa390d014fa3741a5659cf9c94c9 -> trunk/cddec6562eabfa390d014fa3741a5659cf9c94c9 2025-12-04T08:56:59.8188450Z * [new tag] trunk/ce5e7e3bf1f4b69a4f4f93d288ba75b906df492a -> trunk/ce5e7e3bf1f4b69a4f4f93d288ba75b906df492a 2025-12-04T08:56:59.8189638Z * [new tag] trunk/d038b0130ec7c20ebcac219301292fd8e98a1ace -> trunk/d038b0130ec7c20ebcac219301292fd8e98a1ace 2025-12-04T08:56:59.8190831Z * [new tag] trunk/d16447dacaf2420ea175f0c275c75da951f57d39 -> trunk/d16447dacaf2420ea175f0c275c75da951f57d39 2025-12-04T08:56:59.8192103Z * [new tag] trunk/d19f1e8cab6810bb2e99141f9976665954c67a50 -> trunk/d19f1e8cab6810bb2e99141f9976665954c67a50 2025-12-04T08:56:59.8193355Z * [new tag] trunk/d1c9f03b2a5af4104721712f8cdffe9b4f340c01 -> trunk/d1c9f03b2a5af4104721712f8cdffe9b4f340c01 2025-12-04T08:56:59.8194643Z * [new tag] trunk/d40f4950f2b7f7aa380a22fe0f6166e71680fbcf -> trunk/d40f4950f2b7f7aa380a22fe0f6166e71680fbcf 2025-12-04T08:56:59.8196078Z * [new tag] trunk/d5038950bacfe36bbf24a47a455fe76901deb8e8 -> trunk/d5038950bacfe36bbf24a47a455fe76901deb8e8 2025-12-04T08:56:59.8197155Z * [new tag] trunk/d54ff42903c2ae0533931ff11d23b35f875bdb3d -> trunk/d54ff42903c2ae0533931ff11d23b35f875bdb3d 2025-12-04T08:56:59.8198549Z * [new tag] trunk/d76697633a2d2b9cced1ae21161849b33bfe7e47 -> trunk/d76697633a2d2b9cced1ae21161849b33bfe7e47 2025-12-04T08:56:59.8199694Z * [new tag] trunk/d78f52b199c547106d4cd9d2856dd0805c118bf1 -> trunk/d78f52b199c547106d4cd9d2856dd0805c118bf1 2025-12-04T08:56:59.8200937Z * [new tag] trunk/d8fd5c6eed28e5004150691d048a3f6785e19a8e -> trunk/d8fd5c6eed28e5004150691d048a3f6785e19a8e 2025-12-04T08:56:59.8202308Z * [new tag] trunk/d900f5e86745dec76713f4b0ef07005ef36b2f5a -> trunk/d900f5e86745dec76713f4b0ef07005ef36b2f5a 2025-12-04T08:56:59.8203549Z * [new tag] trunk/d973dc6b87d763859fe1c5bd1287e3b6b1c49d1b -> trunk/d973dc6b87d763859fe1c5bd1287e3b6b1c49d1b 2025-12-04T08:56:59.8204851Z * [new tag] trunk/d998c03304cb6ede76e1ed535b4ddeb6c2bf40ec -> trunk/d998c03304cb6ede76e1ed535b4ddeb6c2bf40ec 2025-12-04T08:56:59.8206111Z * [new tag] trunk/d9cb8a70833101dbbe16b99520cfbdd70d0a87bf -> trunk/d9cb8a70833101dbbe16b99520cfbdd70d0a87bf 2025-12-04T08:56:59.8207534Z * [new tag] trunk/d9d5e91b43f70eb8637af55db6856d49be391ffd -> trunk/d9d5e91b43f70eb8637af55db6856d49be391ffd 2025-12-04T08:56:59.8208738Z * [new tag] trunk/dd18a75336a4fbd7497955cc5665904724fce889 -> trunk/dd18a75336a4fbd7497955cc5665904724fce889 2025-12-04T08:56:59.8210072Z * [new tag] trunk/ded9bcd61a059bf723e6e84689552962b480ea77 -> trunk/ded9bcd61a059bf723e6e84689552962b480ea77 2025-12-04T08:56:59.8211665Z * [new tag] trunk/dfbd3714d15c37a7b83b322a6b60f997fc00f50c -> trunk/dfbd3714d15c37a7b83b322a6b60f997fc00f50c 2025-12-04T08:56:59.8213053Z * [new tag] trunk/e115f9f4e4b039f8e9a642aaa2bd8254a920541b -> trunk/e115f9f4e4b039f8e9a642aaa2bd8254a920541b 2025-12-04T08:56:59.8214197Z * [new tag] trunk/e3f24fd73ad74c6e7176687986436956c7c18235 -> trunk/e3f24fd73ad74c6e7176687986436956c7c18235 2025-12-04T08:56:59.8215483Z * [new tag] trunk/e7d24d3ff93d1503ba63860b7057438ad93f918e -> trunk/e7d24d3ff93d1503ba63860b7057438ad93f918e 2025-12-04T08:56:59.8216862Z * [new tag] trunk/ea7035f462a0d2830865ee86c832bd101e1427fc -> trunk/ea7035f462a0d2830865ee86c832bd101e1427fc 2025-12-04T08:56:59.8218061Z * [new tag] trunk/eabb7ad2128580ef674446027b95bcf4e21e8df3 -> trunk/eabb7ad2128580ef674446027b95bcf4e21e8df3 2025-12-04T08:56:59.8219360Z * [new tag] trunk/eb5c63652a33da42e7018c23df5f20a3eb4c6ccf -> trunk/eb5c63652a33da42e7018c23df5f20a3eb4c6ccf 2025-12-04T08:56:59.8220644Z * [new tag] trunk/ec2c71f5c85021b8938cdafadce24c15a36fd93e -> trunk/ec2c71f5c85021b8938cdafadce24c15a36fd93e 2025-12-04T08:56:59.8221902Z * [new tag] trunk/ecbcc3f6bf327856b435b259ac63cc2f328c4b4e -> trunk/ecbcc3f6bf327856b435b259ac63cc2f328c4b4e 2025-12-04T08:56:59.8223592Z * [new tag] trunk/ee87bbe876c42575e961b32a0827d76bc9782ca2 -> trunk/ee87bbe876c42575e961b32a0827d76bc9782ca2 2025-12-04T08:56:59.8224728Z * [new tag] trunk/ef019d1d431c4c5a95b594cb90d40a50cd00f5e4 -> trunk/ef019d1d431c4c5a95b594cb90d40a50cd00f5e4 2025-12-04T08:56:59.8226024Z * [new tag] trunk/ef8ecc13830a86c4b231f1aad9aba7851db61b53 -> trunk/ef8ecc13830a86c4b231f1aad9aba7851db61b53 2025-12-04T08:56:59.8227278Z * [new tag] trunk/f1076f5510920044912247b1abb8760cb820f598 -> trunk/f1076f5510920044912247b1abb8760cb820f598 2025-12-04T08:56:59.8228548Z * [new tag] trunk/f2d6a75a00a1d648ca9a0abc6a33e14c3dea6c40 -> trunk/f2d6a75a00a1d648ca9a0abc6a33e14c3dea6c40 2025-12-04T08:56:59.8229797Z * [new tag] trunk/f47dd0ddef1359e5b43e4b962412f67b30ecde56 -> trunk/f47dd0ddef1359e5b43e4b962412f67b30ecde56 2025-12-04T08:56:59.8231305Z * [new tag] trunk/f49d32dfa4730dcfb1b60eeeb369b5889da983c8 -> trunk/f49d32dfa4730dcfb1b60eeeb369b5889da983c8 2025-12-04T08:56:59.8232474Z * [new tag] trunk/f4dedf78fc30fd4b93975787ca6074ee89db9467 -> trunk/f4dedf78fc30fd4b93975787ca6074ee89db9467 2025-12-04T08:56:59.8233764Z * [new tag] trunk/f7c0d03819ebed05c4038f095d66d1b8c54aca17 -> trunk/f7c0d03819ebed05c4038f095d66d1b8c54aca17 2025-12-04T08:56:59.8235035Z * [new tag] trunk/f7e1bd80a063e17453c361837ba6ea2570920a73 -> trunk/f7e1bd80a063e17453c361837ba6ea2570920a73 2025-12-04T08:56:59.8236194Z * [new tag] trunk/f9bd6c53624c7c0ea3772de78498326e84c2f0e7 -> trunk/f9bd6c53624c7c0ea3772de78498326e84c2f0e7 2025-12-04T08:56:59.8237694Z * [new tag] trunk/fb5be221a46b51bfc9509013b0d85bc5a9d4f15b -> trunk/fb5be221a46b51bfc9509013b0d85bc5a9d4f15b 2025-12-04T08:56:59.8238878Z * [new tag] trunk/fdf863d5e1de3b2688c9511e96876e34581dbfd7 -> trunk/fdf863d5e1de3b2688c9511e96876e34581dbfd7 2025-12-04T08:56:59.8240651Z * [new tag] trunk/fe0e65adfc0e7ca6e5f57e6ea8b16bd5cc967307 -> trunk/fe0e65adfc0e7ca6e5f57e6ea8b16bd5cc967307 2025-12-04T08:56:59.8241817Z * [new tag] trunk/fec710bf89173f5355468a7ce1afe9157c3d9009 -> trunk/fec710bf89173f5355468a7ce1afe9157c3d9009 2025-12-04T08:56:59.8243340Z * [new tag] trunk/ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 -> trunk/ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:56:59.8244339Z * [new tag] v0.1.1 -> v0.1.1 2025-12-04T08:56:59.8245562Z * [new tag] v0.1.10 -> v0.1.10 2025-12-04T08:56:59.8246648Z * [new tag] v0.1.11 -> v0.1.11 2025-12-04T08:56:59.8247974Z * [new tag] v0.1.12 -> v0.1.12 2025-12-04T08:56:59.8249277Z * [new tag] v0.1.2 -> v0.1.2 2025-12-04T08:56:59.8250428Z * [new tag] v0.1.3 -> v0.1.3 2025-12-04T08:56:59.8251553Z * [new tag] v0.1.4 -> v0.1.4 2025-12-04T08:56:59.8252887Z * [new tag] v0.1.5 -> v0.1.5 2025-12-04T08:56:59.8254094Z * [new tag] v0.1.6 -> v0.1.6 2025-12-04T08:56:59.8255222Z * [new tag] v0.1.7 -> v0.1.7 2025-12-04T08:56:59.8256347Z * [new tag] v0.1.8 -> v0.1.8 2025-12-04T08:56:59.8257545Z * [new tag] v0.1.9 -> v0.1.9 2025-12-04T08:56:59.8258752Z * [new tag] v0.2.0 -> v0.2.0 2025-12-04T08:56:59.8259976Z * [new tag] v0.3.0 -> v0.3.0 2025-12-04T08:56:59.8261238Z * [new tag] v0.3.1 -> v0.3.1 2025-12-04T08:56:59.8262686Z * [new tag] v0.4.0 -> v0.4.0 2025-12-04T08:56:59.8263780Z * [new tag] v0.4.1 -> v0.4.1 2025-12-04T08:56:59.8264970Z * [new tag] v1.0.0 -> v1.0.0 2025-12-04T08:56:59.8266174Z * [new tag] v1.0.0a0 -> v1.0.0a0 2025-12-04T08:56:59.8267432Z * [new tag] v1.0.1 -> v1.0.1 2025-12-04T08:56:59.8268627Z * [new tag] v1.0rc0 -> v1.0rc0 2025-12-04T08:56:59.8269574Z * [new tag] v1.0rc1 -> v1.0rc1 2025-12-04T08:56:59.8270872Z * [new tag] v1.1.0 -> v1.1.0 2025-12-04T08:56:59.8272540Z * [new tag] v1.1.0a0 -> v1.1.0a0 2025-12-04T08:56:59.8273887Z * [new tag] v1.10.0 -> v1.10.0 2025-12-04T08:56:59.8275280Z * [new tag] v1.10.0-rc1 -> v1.10.0-rc1 2025-12-04T08:56:59.8276494Z * [new tag] v1.10.0-rc2 -> v1.10.0-rc2 2025-12-04T08:56:59.8277446Z * [new tag] v1.10.0-rc3 -> v1.10.0-rc3 2025-12-04T08:56:59.8278742Z * [new tag] v1.10.1 -> v1.10.1 2025-12-04T08:56:59.8279790Z * [new tag] v1.10.1-rc1 -> v1.10.1-rc1 2025-12-04T08:56:59.8280742Z * [new tag] v1.10.2 -> v1.10.2 2025-12-04T08:56:59.8281894Z * [new tag] v1.10.2-rc1 -> v1.10.2-rc1 2025-12-04T08:56:59.8283096Z * [new tag] v1.11.0 -> v1.11.0 2025-12-04T08:56:59.8284456Z * [new tag] v1.11.0-rc1 -> v1.11.0-rc1 2025-12-04T08:56:59.8285767Z * [new tag] v1.11.0-rc2 -> v1.11.0-rc2 2025-12-04T08:56:59.8287068Z * [new tag] v1.11.0-rc3 -> v1.11.0-rc3 2025-12-04T08:56:59.8288417Z * [new tag] v1.11.0-rc4 -> v1.11.0-rc4 2025-12-04T08:56:59.8289667Z * [new tag] v1.11.0-rc5 -> v1.11.0-rc5 2025-12-04T08:56:59.8290589Z * [new tag] v1.11.0-rc6 -> v1.11.0-rc6 2025-12-04T08:56:59.8291739Z * [new tag] v1.11.0-rc7 -> v1.11.0-rc7 2025-12-04T08:56:59.8293121Z * [new tag] v1.12.0 -> v1.12.0 2025-12-04T08:56:59.8294322Z * [new tag] v1.12.0-rc1 -> v1.12.0-rc1 2025-12-04T08:56:59.8295686Z * [new tag] v1.12.0-rc2 -> v1.12.0-rc2 2025-12-04T08:56:59.8296826Z * [new tag] v1.12.0-rc3 -> v1.12.0-rc3 2025-12-04T08:56:59.8298011Z * [new tag] v1.12.0-rc4 -> v1.12.0-rc4 2025-12-04T08:56:59.8299246Z * [new tag] v1.12.0-rc5 -> v1.12.0-rc5 2025-12-04T08:56:59.8300462Z * [new tag] v1.12.0-rc6 -> v1.12.0-rc6 2025-12-04T08:56:59.8301670Z * [new tag] v1.12.0-rc7 -> v1.12.0-rc7 2025-12-04T08:56:59.8302685Z * [new tag] v1.12.0-rc8 -> v1.12.0-rc8 2025-12-04T08:56:59.8303693Z * [new tag] v1.12.1 -> v1.12.1 2025-12-04T08:56:59.8304989Z * [new tag] v1.12.1-rc1 -> v1.12.1-rc1 2025-12-04T08:56:59.8306261Z * [new tag] v1.12.1-rc2 -> v1.12.1-rc2 2025-12-04T08:56:59.8307499Z * [new tag] v1.12.1-rc3 -> v1.12.1-rc3 2025-12-04T08:56:59.8308734Z * [new tag] v1.12.1-rc4 -> v1.12.1-rc4 2025-12-04T08:56:59.8309775Z * [new tag] v1.12.1-rc5 -> v1.12.1-rc5 2025-12-04T08:56:59.8311035Z * [new tag] v1.13.0 -> v1.13.0 2025-12-04T08:56:59.8312172Z * [new tag] v1.13.0-rc1 -> v1.13.0-rc1 2025-12-04T08:56:59.8313325Z * [new tag] v1.13.0-rc2 -> v1.13.0-rc2 2025-12-04T08:56:59.8314549Z * [new tag] v1.13.0-rc3 -> v1.13.0-rc3 2025-12-04T08:56:59.8315972Z * [new tag] v1.13.0-rc4 -> v1.13.0-rc4 2025-12-04T08:56:59.8316962Z * [new tag] v1.13.0-rc5 -> v1.13.0-rc5 2025-12-04T08:56:59.8317977Z * [new tag] v1.13.0-rc6 -> v1.13.0-rc6 2025-12-04T08:56:59.8319212Z * [new tag] v1.13.1 -> v1.13.1 2025-12-04T08:56:59.8320275Z * [new tag] v1.13.1-rc1 -> v1.13.1-rc1 2025-12-04T08:56:59.8321445Z * [new tag] v1.2.0 -> v1.2.0 2025-12-04T08:56:59.8322642Z * [new tag] v1.2.0a0 -> v1.2.0a0 2025-12-04T08:56:59.8323860Z * [new tag] v1.3.0 -> v1.3.0 2025-12-04T08:56:59.8325075Z * [new tag] v1.3.0a0 -> v1.3.0a0 2025-12-04T08:56:59.8326166Z * [new tag] v1.3.1 -> v1.3.1 2025-12-04T08:56:59.8327355Z * [new tag] v1.4.0 -> v1.4.0 2025-12-04T08:56:59.8328525Z * [new tag] v1.4.0a0 -> v1.4.0a0 2025-12-04T08:56:59.8329588Z * [new tag] v1.4.1 -> v1.4.1 2025-12-04T08:56:59.8330802Z * [new tag] v1.5.0 -> v1.5.0 2025-12-04T08:56:59.8332071Z * [new tag] v1.5.0-rc1 -> v1.5.0-rc1 2025-12-04T08:56:59.8333495Z * [new tag] v1.5.0-rc2 -> v1.5.0-rc2 2025-12-04T08:56:59.8334774Z * [new tag] v1.5.0-rc3 -> v1.5.0-rc3 2025-12-04T08:56:59.8335972Z * [new tag] v1.5.0-rc4 -> v1.5.0-rc4 2025-12-04T08:56:59.8337009Z * [new tag] v1.5.0-rc5 -> v1.5.0-rc5 2025-12-04T08:56:59.8338256Z * [new tag] v1.5.1 -> v1.5.1 2025-12-04T08:56:59.8339313Z * [new tag] v1.5.1-rc1 -> v1.5.1-rc1 2025-12-04T08:56:59.8340325Z * [new tag] v1.6.0 -> v1.6.0 2025-12-04T08:56:59.8341592Z * [new tag] v1.6.0-rc1 -> v1.6.0-rc1 2025-12-04T08:56:59.8342880Z * [new tag] v1.6.0-rc2 -> v1.6.0-rc2 2025-12-04T08:56:59.8344160Z * [new tag] v1.6.0-rc3 -> v1.6.0-rc3 2025-12-04T08:56:59.8345429Z * [new tag] v1.6.0-rc4 -> v1.6.0-rc4 2025-12-04T08:56:59.8346566Z * [new tag] v1.6.0-rc5 -> v1.6.0-rc5 2025-12-04T08:56:59.8347745Z * [new tag] v1.6.0-rc6 -> v1.6.0-rc6 2025-12-04T08:56:59.8349225Z * [new tag] v1.6.0-rc7 -> v1.6.0-rc7 2025-12-04T08:56:59.8350494Z * [new tag] v1.7.0 -> v1.7.0 2025-12-04T08:56:59.8351723Z * [new tag] v1.7.0-rc1 -> v1.7.0-rc1 2025-12-04T08:56:59.8353130Z * [new tag] v1.7.0-rc2 -> v1.7.0-rc2 2025-12-04T08:56:59.8354376Z * [new tag] v1.7.0-rc3 -> v1.7.0-rc3 2025-12-04T08:56:59.8355402Z * [new tag] v1.7.0-rc4 -> v1.7.0-rc4 2025-12-04T08:56:59.8356622Z * [new tag] v1.7.1 -> v1.7.1 2025-12-04T08:56:59.8357970Z * [new tag] v1.7.1-rc1 -> v1.7.1-rc1 2025-12-04T08:56:59.8359277Z * [new tag] v1.7.1-rc2 -> v1.7.1-rc2 2025-12-04T08:56:59.8360335Z * [new tag] v1.7.1-rc3 -> v1.7.1-rc3 2025-12-04T08:56:59.8361559Z * [new tag] v1.8.0 -> v1.8.0 2025-12-04T08:56:59.8362985Z * [new tag] v1.8.0-rc1 -> v1.8.0-rc1 2025-12-04T08:56:59.8364255Z * [new tag] v1.8.0-rc2 -> v1.8.0-rc2 2025-12-04T08:56:59.8365512Z * [new tag] v1.8.0-rc3 -> v1.8.0-rc3 2025-12-04T08:56:59.8366691Z * [new tag] v1.8.0-rc4 -> v1.8.0-rc4 2025-12-04T08:56:59.8367623Z * [new tag] v1.8.0-rc5 -> v1.8.0-rc5 2025-12-04T08:56:59.8368817Z * [new tag] v1.8.1 -> v1.8.1 2025-12-04T08:56:59.8370054Z * [new tag] v1.8.1-rc1 -> v1.8.1-rc1 2025-12-04T08:56:59.8371017Z * [new tag] v1.8.1-rc2 -> v1.8.1-rc2 2025-12-04T08:56:59.8372147Z * [new tag] v1.8.1-rc3 -> v1.8.1-rc3 2025-12-04T08:56:59.8373857Z * [new tag] v1.8.2 -> v1.8.2 2025-12-04T08:56:59.8374739Z * [new tag] v1.8.2-rc1 -> v1.8.2-rc1 2025-12-04T08:56:59.8376115Z * [new tag] v1.9.0 -> v1.9.0 2025-12-04T08:56:59.8377361Z * [new tag] v1.9.0-rc1 -> v1.9.0-rc1 2025-12-04T08:56:59.8378635Z * [new tag] v1.9.0-rc2 -> v1.9.0-rc2 2025-12-04T08:56:59.8379719Z * [new tag] v1.9.0-rc3 -> v1.9.0-rc3 2025-12-04T08:56:59.8380810Z * [new tag] v1.9.0-rc4 -> v1.9.0-rc4 2025-12-04T08:56:59.8382085Z * [new tag] v1.9.1 -> v1.9.1 2025-12-04T08:56:59.8383401Z * [new tag] v1.9.1-rc1 -> v1.9.1-rc1 2025-12-04T08:56:59.8384457Z * [new tag] v1.9.1-rc2 -> v1.9.1-rc2 2025-12-04T08:56:59.8385744Z * [new tag] v2.0.0 -> v2.0.0 2025-12-04T08:56:59.8386933Z * [new tag] v2.0.0-rc1 -> v2.0.0-rc1 2025-12-04T08:56:59.8388204Z * [new tag] v2.0.0-rc2 -> v2.0.0-rc2 2025-12-04T08:56:59.8389459Z * [new tag] v2.0.0-rc3 -> v2.0.0-rc3 2025-12-04T08:56:59.8390689Z * [new tag] v2.0.0-rc4 -> v2.0.0-rc4 2025-12-04T08:56:59.8391935Z * [new tag] v2.0.0-rc5 -> v2.0.0-rc5 2025-12-04T08:56:59.8393033Z * [new tag] v2.0.0-rc6 -> v2.0.0-rc6 2025-12-04T08:56:59.8394289Z * [new tag] v2.0.1 -> v2.0.1 2025-12-04T08:56:59.8395695Z * [new tag] v2.0.1-rc1 -> v2.0.1-rc1 2025-12-04T08:56:59.8396545Z * [new tag] v2.0.1-rc2 -> v2.0.1-rc2 2025-12-04T08:56:59.8397794Z * [new tag] v2.0.1-rc3 -> v2.0.1-rc3 2025-12-04T08:56:59.8398836Z * [new tag] v2.0.1-rc4 -> v2.0.1-rc4 2025-12-04T08:56:59.8400550Z * [new tag] v2.1.0 -> v2.1.0 2025-12-04T08:56:59.8401760Z * [new tag] v2.1.0-rc1 -> v2.1.0-rc1 2025-12-04T08:56:59.8402969Z * [new tag] v2.1.0-rc2 -> v2.1.0-rc2 2025-12-04T08:56:59.8404334Z * [new tag] v2.1.0-rc3 -> v2.1.0-rc3 2025-12-04T08:56:59.8405686Z * [new tag] v2.1.0-rc4 -> v2.1.0-rc4 2025-12-04T08:56:59.8406939Z * [new tag] v2.1.0-rc5 -> v2.1.0-rc5 2025-12-04T08:56:59.8408001Z * [new tag] v2.1.0-rc6 -> v2.1.0-rc6 2025-12-04T08:56:59.8409210Z * [new tag] v2.1.1 -> v2.1.1 2025-12-04T08:56:59.8410583Z * [new tag] v2.1.1-rc1 -> v2.1.1-rc1 2025-12-04T08:56:59.8411916Z * [new tag] v2.1.1-rc2 -> v2.1.1-rc2 2025-12-04T08:56:59.8413397Z * [new tag] v2.1.1-rc3 -> v2.1.1-rc3 2025-12-04T08:56:59.8414640Z * [new tag] v2.1.1-rc4 -> v2.1.1-rc4 2025-12-04T08:56:59.8415836Z * [new tag] v2.1.1-rc5 -> v2.1.1-rc5 2025-12-04T08:56:59.8416874Z * [new tag] v2.1.1-rc6 -> v2.1.1-rc6 2025-12-04T08:56:59.8418085Z * [new tag] v2.1.2 -> v2.1.2 2025-12-04T08:56:59.8419336Z * [new tag] v2.1.2-rc1 -> v2.1.2-rc1 2025-12-04T08:56:59.8420605Z * [new tag] v2.1.2-rc2 -> v2.1.2-rc2 2025-12-04T08:56:59.8421647Z * [new tag] v2.1.2-rc3 -> v2.1.2-rc3 2025-12-04T08:56:59.8422922Z * [new tag] v2.2.0 -> v2.2.0 2025-12-04T08:56:59.8424153Z * [new tag] v2.2.0-rc1 -> v2.2.0-rc1 2025-12-04T08:56:59.8425346Z * [new tag] v2.2.0-rc2 -> v2.2.0-rc2 2025-12-04T08:56:59.8426540Z * [new tag] v2.2.0-rc3 -> v2.2.0-rc3 2025-12-04T08:56:59.8428174Z * [new tag] v2.2.0-rc4 -> v2.2.0-rc4 2025-12-04T08:56:59.8429365Z * [new tag] v2.2.0-rc5 -> v2.2.0-rc5 2025-12-04T08:56:59.8430555Z * [new tag] v2.2.0-rc6 -> v2.2.0-rc6 2025-12-04T08:56:59.8431606Z * [new tag] v2.2.0-rc7 -> v2.2.0-rc7 2025-12-04T08:56:59.8432690Z * [new tag] v2.2.0-rc8 -> v2.2.0-rc8 2025-12-04T08:56:59.8433894Z * [new tag] v2.2.1 -> v2.2.1 2025-12-04T08:56:59.8435223Z * [new tag] v2.2.1-rc1 -> v2.2.1-rc1 2025-12-04T08:56:59.8436282Z * [new tag] v2.2.1-rc2 -> v2.2.1-rc2 2025-12-04T08:56:59.8437339Z * [new tag] v2.2.1-rc3 -> v2.2.1-rc3 2025-12-04T08:56:59.8438392Z * [new tag] v2.2.2 -> v2.2.2 2025-12-04T08:56:59.8439725Z * [new tag] v2.2.2-rc1 -> v2.2.2-rc1 2025-12-04T08:56:59.8440785Z * [new tag] v2.2.2-rc2 -> v2.2.2-rc2 2025-12-04T08:56:59.8441828Z * [new tag] v2.2.2-rc3 -> v2.2.2-rc3 2025-12-04T08:56:59.8443149Z * [new tag] v2.3.0 -> v2.3.0 2025-12-04T08:56:59.8444397Z * [new tag] v2.3.0-rc1 -> v2.3.0-rc1 2025-12-04T08:56:59.8445764Z * [new tag] v2.3.0-rc10 -> v2.3.0-rc10 2025-12-04T08:56:59.8446995Z * [new tag] v2.3.0-rc11 -> v2.3.0-rc11 2025-12-04T08:56:59.8448037Z * [new tag] v2.3.0-rc12 -> v2.3.0-rc12 2025-12-04T08:56:59.8449257Z * [new tag] v2.3.0-rc2 -> v2.3.0-rc2 2025-12-04T08:56:59.8450546Z * [new tag] v2.3.0-rc3 -> v2.3.0-rc3 2025-12-04T08:56:59.8451765Z * [new tag] v2.3.0-rc4 -> v2.3.0-rc4 2025-12-04T08:56:59.8453143Z * [new tag] v2.3.0-rc5 -> v2.3.0-rc5 2025-12-04T08:56:59.8454187Z * [new tag] v2.3.0-rc6 -> v2.3.0-rc6 2025-12-04T08:56:59.8455426Z * [new tag] v2.3.0-rc7 -> v2.3.0-rc7 2025-12-04T08:56:59.8456679Z * [new tag] v2.3.0-rc8 -> v2.3.0-rc8 2025-12-04T08:56:59.8457732Z * [new tag] v2.3.0-rc9 -> v2.3.0-rc9 2025-12-04T08:56:59.8458794Z * [new tag] v2.3.1 -> v2.3.1 2025-12-04T08:56:59.8460053Z * [new tag] v2.3.1-rc1 -> v2.3.1-rc1 2025-12-04T08:56:59.8461292Z * [new tag] v2.3.1-rc2 -> v2.3.1-rc2 2025-12-04T08:56:59.8464956Z * [new tag] v2.3.1-rc3 -> v2.3.1-rc3 2025-12-04T08:56:59.8466202Z * [new tag] v2.4.0 -> v2.4.0 2025-12-04T08:56:59.8467466Z * [new tag] v2.4.0-rc1 -> v2.4.0-rc1 2025-12-04T08:56:59.8468645Z * [new tag] v2.4.0-rc2 -> v2.4.0-rc2 2025-12-04T08:56:59.8469852Z * [new tag] v2.4.0-rc3 -> v2.4.0-rc3 2025-12-04T08:56:59.8471042Z * [new tag] v2.4.0-rc4 -> v2.4.0-rc4 2025-12-04T08:56:59.8472367Z * [new tag] v2.4.0-rc5 -> v2.4.0-rc5 2025-12-04T08:56:59.8473654Z * [new tag] v2.4.0-rc6 -> v2.4.0-rc6 2025-12-04T08:56:59.8474911Z * [new tag] v2.4.0-rc7 -> v2.4.0-rc7 2025-12-04T08:56:59.8476096Z * [new tag] v2.4.0-rc8 -> v2.4.0-rc8 2025-12-04T08:56:59.8477326Z * [new tag] v2.4.0-rc9 -> v2.4.0-rc9 2025-12-04T08:56:59.8478410Z * [new tag] v2.4.1 -> v2.4.1 2025-12-04T08:56:59.8479695Z * [new tag] v2.4.1-rc1 -> v2.4.1-rc1 2025-12-04T08:56:59.8480932Z * [new tag] v2.4.1-rc2 -> v2.4.1-rc2 2025-12-04T08:56:59.8482203Z * [new tag] v2.4.1-rc3 -> v2.4.1-rc3 2025-12-04T08:56:59.8483377Z * [new tag] v2.5.0 -> v2.5.0 2025-12-04T08:56:59.8484659Z * [new tag] v2.5.0-rc1 -> v2.5.0-rc1 2025-12-04T08:56:59.8485731Z * [new tag] v2.5.0-rc10 -> v2.5.0-rc10 2025-12-04T08:56:59.8486982Z * [new tag] v2.5.0-rc2 -> v2.5.0-rc2 2025-12-04T08:56:59.8488188Z * [new tag] v2.5.0-rc3 -> v2.5.0-rc3 2025-12-04T08:56:59.8489411Z * [new tag] v2.5.0-rc4 -> v2.5.0-rc4 2025-12-04T08:56:59.8490656Z * [new tag] v2.5.0-rc5 -> v2.5.0-rc5 2025-12-04T08:56:59.8491943Z * [new tag] v2.5.0-rc6 -> v2.5.0-rc6 2025-12-04T08:56:59.8493410Z * [new tag] v2.5.0-rc7 -> v2.5.0-rc7 2025-12-04T08:56:59.8494624Z * [new tag] v2.5.0-rc8 -> v2.5.0-rc8 2025-12-04T08:56:59.8495878Z * [new tag] v2.5.0-rc9 -> v2.5.0-rc9 2025-12-04T08:56:59.8496923Z * [new tag] v2.5.1 -> v2.5.1 2025-12-04T08:56:59.8498099Z * [new tag] v2.5.1-rc1 -> v2.5.1-rc1 2025-12-04T08:56:59.8498882Z * [new tag] v2.6.0 -> v2.6.0 2025-12-04T08:56:59.8500251Z * [new tag] v2.6.0-rc1 -> v2.6.0-rc1 2025-12-04T08:56:59.8501555Z * [new tag] v2.6.0-rc2 -> v2.6.0-rc2 2025-12-04T08:56:59.8502799Z * [new tag] v2.6.0-rc3 -> v2.6.0-rc3 2025-12-04T08:56:59.8503960Z * [new tag] v2.6.0-rc4 -> v2.6.0-rc4 2025-12-04T08:56:59.8505350Z * [new tag] v2.6.0-rc5 -> v2.6.0-rc5 2025-12-04T08:56:59.8506712Z * [new tag] v2.6.0-rc6 -> v2.6.0-rc6 2025-12-04T08:56:59.8508405Z * [new tag] v2.6.0-rc7 -> v2.6.0-rc7 2025-12-04T08:56:59.8509701Z * [new tag] v2.6.0-rc8 -> v2.6.0-rc8 2025-12-04T08:56:59.8510962Z * [new tag] v2.6.0-rc9 -> v2.6.0-rc9 2025-12-04T08:56:59.8512421Z * [new tag] v2.7.0 -> v2.7.0 2025-12-04T08:56:59.8513694Z * [new tag] v2.7.0-rc1 -> v2.7.0-rc1 2025-12-04T08:56:59.8514786Z * [new tag] v2.7.0-rc10 -> v2.7.0-rc10 2025-12-04T08:56:59.8516161Z * [new tag] v2.7.0-rc2 -> v2.7.0-rc2 2025-12-04T08:56:59.8517463Z * [new tag] v2.7.0-rc3 -> v2.7.0-rc3 2025-12-04T08:56:59.8518722Z * [new tag] v2.7.0-rc4 -> v2.7.0-rc4 2025-12-04T08:56:59.8519937Z * [new tag] v2.7.0-rc5 -> v2.7.0-rc5 2025-12-04T08:56:59.8521139Z * [new tag] v2.7.0-rc6 -> v2.7.0-rc6 2025-12-04T08:56:59.8522407Z * [new tag] v2.7.0-rc7 -> v2.7.0-rc7 2025-12-04T08:56:59.8523700Z * [new tag] v2.7.0-rc8 -> v2.7.0-rc8 2025-12-04T08:56:59.8525154Z * [new tag] v2.7.0-rc9 -> v2.7.0-rc9 2025-12-04T08:56:59.8526218Z * [new tag] v2.7.1 -> v2.7.1 2025-12-04T08:56:59.8527502Z * [new tag] v2.7.1-rc1 -> v2.7.1-rc1 2025-12-04T08:56:59.8528780Z * [new tag] v2.7.1-rc2 -> v2.7.1-rc2 2025-12-04T08:56:59.8530106Z * [new tag] v2.7.1-rc3 -> v2.7.1-rc3 2025-12-04T08:56:59.8531400Z * [new tag] v2.7.1-rc4 -> v2.7.1-rc4 2025-12-04T08:56:59.8532777Z * [new tag] v2.7.1-rc5 -> v2.7.1-rc5 2025-12-04T08:56:59.8533866Z * [new tag] v2.8.0 -> v2.8.0 2025-12-04T08:56:59.8535089Z * [new tag] v2.8.0-rc1 -> v2.8.0-rc1 2025-12-04T08:56:59.8536354Z * [new tag] v2.8.0-rc2 -> v2.8.0-rc2 2025-12-04T08:56:59.8537627Z * [new tag] v2.8.0-rc3 -> v2.8.0-rc3 2025-12-04T08:56:59.8539050Z * [new tag] v2.8.0-rc4 -> v2.8.0-rc4 2025-12-04T08:56:59.8540311Z * [new tag] v2.8.0-rc5 -> v2.8.0-rc5 2025-12-04T08:56:59.8541598Z * [new tag] v2.8.0-rc6 -> v2.8.0-rc6 2025-12-04T08:56:59.8542880Z * [new tag] v2.8.0-rc7 -> v2.8.0-rc7 2025-12-04T08:56:59.8544113Z * [new tag] v2.8.0-rc8 -> v2.8.0-rc8 2025-12-04T08:56:59.8545366Z * [new tag] v2.9.0 -> v2.9.0 2025-12-04T08:56:59.8546657Z * [new tag] v2.9.0-rc1 -> v2.9.0-rc1 2025-12-04T08:56:59.8547942Z * [new tag] v2.9.0-rc10 -> v2.9.0-rc10 2025-12-04T08:56:59.8549262Z * [new tag] v2.9.0-rc11 -> v2.9.0-rc11 2025-12-04T08:56:59.8550954Z * [new tag] v2.9.0-rc2 -> v2.9.0-rc2 2025-12-04T08:56:59.8552123Z * [new tag] v2.9.0-rc3 -> v2.9.0-rc3 2025-12-04T08:56:59.8553388Z * [new tag] v2.9.0-rc4 -> v2.9.0-rc4 2025-12-04T08:56:59.8554635Z * [new tag] v2.9.0-rc5 -> v2.9.0-rc5 2025-12-04T08:56:59.8556106Z * [new tag] v2.9.0-rc6 -> v2.9.0-rc6 2025-12-04T08:56:59.8557388Z * [new tag] v2.9.0-rc7 -> v2.9.0-rc7 2025-12-04T08:56:59.8558817Z * [new tag] v2.9.0-rc8 -> v2.9.0-rc8 2025-12-04T08:56:59.8559909Z * [new tag] v2.9.0-rc9 -> v2.9.0-rc9 2025-12-04T08:56:59.8560983Z * [new tag] v2.9.1 -> v2.9.1 2025-12-04T08:56:59.8562402Z * [new tag] v2.9.1-rc1 -> v2.9.1-rc1 2025-12-04T08:56:59.8563764Z * [new tag] v2.9.1-rc2 -> v2.9.1-rc2 2025-12-04T08:56:59.8565541Z * [new tag] viable/strict/1759343184 -> viable/strict/1759343184 2025-12-04T08:56:59.8566793Z * [new tag] viable/strict/1759346540 -> viable/strict/1759346540 2025-12-04T08:56:59.8567965Z * [new tag] viable/strict/1759348181 -> viable/strict/1759348181 2025-12-04T08:56:59.8569118Z * [new tag] viable/strict/1759350324 -> viable/strict/1759350324 2025-12-04T08:56:59.8570285Z * [new tag] viable/strict/1759351793 -> viable/strict/1759351793 2025-12-04T08:56:59.8571454Z * [new tag] viable/strict/1759353844 -> viable/strict/1759353844 2025-12-04T08:56:59.8572824Z * [new tag] viable/strict/1759355374 -> viable/strict/1759355374 2025-12-04T08:56:59.8574036Z * [new tag] viable/strict/1759357472 -> viable/strict/1759357472 2025-12-04T08:56:59.8575214Z * [new tag] viable/strict/1759361002 -> viable/strict/1759361002 2025-12-04T08:56:59.8576745Z * [new tag] viable/strict/1759362585 -> viable/strict/1759362585 2025-12-04T08:56:59.8578182Z * [new tag] viable/strict/1759365359 -> viable/strict/1759365359 2025-12-04T08:56:59.8579468Z * [new tag] viable/strict/1759370089 -> viable/strict/1759370089 2025-12-04T08:56:59.8580794Z * [new tag] viable/strict/1759377554 -> viable/strict/1759377554 2025-12-04T08:56:59.8582032Z * [new tag] viable/strict/1759379133 -> viable/strict/1759379133 2025-12-04T08:56:59.8583245Z * [new tag] viable/strict/1759389871 -> viable/strict/1759389871 2025-12-04T08:56:59.8584499Z * [new tag] viable/strict/1759393562 -> viable/strict/1759393562 2025-12-04T08:56:59.8585753Z * [new tag] viable/strict/1759395076 -> viable/strict/1759395076 2025-12-04T08:56:59.8587037Z * [new tag] viable/strict/1759398579 -> viable/strict/1759398579 2025-12-04T08:56:59.8588279Z * [new tag] viable/strict/1759404142 -> viable/strict/1759404142 2025-12-04T08:56:59.8589514Z * [new tag] viable/strict/1759405773 -> viable/strict/1759405773 2025-12-04T08:56:59.8590787Z * [new tag] viable/strict/1759408041 -> viable/strict/1759408041 2025-12-04T08:56:59.8591997Z * [new tag] viable/strict/1759411593 -> viable/strict/1759411593 2025-12-04T08:56:59.8593257Z * [new tag] viable/strict/1759427395 -> viable/strict/1759427395 2025-12-04T08:56:59.8594471Z * [new tag] viable/strict/1759434582 -> viable/strict/1759434582 2025-12-04T08:56:59.8595713Z * [new tag] viable/strict/1759436720 -> viable/strict/1759436720 2025-12-04T08:56:59.8597004Z * [new tag] viable/strict/1759440219 -> viable/strict/1759440219 2025-12-04T08:56:59.8598250Z * [new tag] viable/strict/1759441948 -> viable/strict/1759441948 2025-12-04T08:56:59.8599684Z * [new tag] viable/strict/1759443860 -> viable/strict/1759443860 2025-12-04T08:56:59.8600876Z * [new tag] viable/strict/1759445377 -> viable/strict/1759445377 2025-12-04T08:56:59.8602093Z * [new tag] viable/strict/1759447415 -> viable/strict/1759447415 2025-12-04T08:56:59.8603306Z * [new tag] viable/strict/1759451750 -> viable/strict/1759451750 2025-12-04T08:56:59.8604722Z * [new tag] viable/strict/1759453910 -> viable/strict/1759453910 2025-12-04T08:56:59.8605955Z * [new tag] viable/strict/1759456483 -> viable/strict/1759456483 2025-12-04T08:56:59.8607205Z * [new tag] viable/strict/1759459279 -> viable/strict/1759459279 2025-12-04T08:56:59.8608492Z * [new tag] viable/strict/1759460742 -> viable/strict/1759460742 2025-12-04T08:56:59.8609820Z * [new tag] viable/strict/1759462025 -> viable/strict/1759462025 2025-12-04T08:56:59.8611087Z * [new tag] viable/strict/1759469086 -> viable/strict/1759469086 2025-12-04T08:56:59.8612368Z * [new tag] viable/strict/1759470581 -> viable/strict/1759470581 2025-12-04T08:56:59.8613745Z * [new tag] viable/strict/1759472786 -> viable/strict/1759472786 2025-12-04T08:56:59.8615021Z * [new tag] viable/strict/1759476294 -> viable/strict/1759476294 2025-12-04T08:56:59.8616212Z * [new tag] viable/strict/1759479963 -> viable/strict/1759479963 2025-12-04T08:56:59.8617435Z * [new tag] viable/strict/1759492177 -> viable/strict/1759492177 2025-12-04T08:56:59.8618723Z * [new tag] viable/strict/1759519278 -> viable/strict/1759519278 2025-12-04T08:56:59.8620067Z * [new tag] viable/strict/1759524580 -> viable/strict/1759524580 2025-12-04T08:56:59.8621186Z * [new tag] viable/strict/1759528193 -> viable/strict/1759528193 2025-12-04T08:56:59.8622569Z * [new tag] viable/strict/1759533797 -> viable/strict/1759533797 2025-12-04T08:56:59.8623862Z * [new tag] viable/strict/1759542780 -> viable/strict/1759542780 2025-12-04T08:56:59.8625162Z * [new tag] viable/strict/1759549779 -> viable/strict/1759549779 2025-12-04T08:56:59.8626429Z * [new tag] viable/strict/1759555455 -> viable/strict/1759555455 2025-12-04T08:56:59.8627662Z * [new tag] viable/strict/1759559176 -> viable/strict/1759559176 2025-12-04T08:56:59.8628956Z * [new tag] viable/strict/1759560629 -> viable/strict/1759560629 2025-12-04T08:56:59.8630204Z * [new tag] viable/strict/1759569848 -> viable/strict/1759569848 2025-12-04T08:56:59.8631543Z * [new tag] viable/strict/1759571382 -> viable/strict/1759571382 2025-12-04T08:56:59.8632834Z * [new tag] viable/strict/1759573474 -> viable/strict/1759573474 2025-12-04T08:56:59.8634030Z * [new tag] viable/strict/1759618187 -> viable/strict/1759618187 2025-12-04T08:56:59.8635280Z * [new tag] viable/strict/1759626742 -> viable/strict/1759626742 2025-12-04T08:56:59.8636527Z * [new tag] viable/strict/1759632427 -> viable/strict/1759632427 2025-12-04T08:56:59.8637748Z * [new tag] viable/strict/1759634971 -> viable/strict/1759634971 2025-12-04T08:56:59.8639663Z * [new tag] viable/strict/1759661382 -> viable/strict/1759661382 2025-12-04T08:56:59.8640913Z * [new tag] viable/strict/1759663294 -> viable/strict/1759663294 2025-12-04T08:56:59.8642064Z * [new tag] viable/strict/1759708178 -> viable/strict/1759708178 2025-12-04T08:56:59.8643293Z * [new tag] viable/strict/1759715695 -> viable/strict/1759715695 2025-12-04T08:56:59.8644643Z * [new tag] viable/strict/1759728293 -> viable/strict/1759728293 2025-12-04T08:56:59.8645886Z * [new tag] viable/strict/1759735513 -> viable/strict/1759735513 2025-12-04T08:56:59.8647161Z * [new tag] viable/strict/1759739177 -> viable/strict/1759739177 2025-12-04T08:56:59.8648466Z * [new tag] viable/strict/1759758635 -> viable/strict/1759758635 2025-12-04T08:56:59.8649721Z * [new tag] viable/strict/1759765784 -> viable/strict/1759765784 2025-12-04T08:56:59.8650976Z * [new tag] viable/strict/1759767948 -> viable/strict/1759767948 2025-12-04T08:56:59.8652257Z * [new tag] viable/strict/1759771461 -> viable/strict/1759771461 2025-12-04T08:56:59.8653550Z * [new tag] viable/strict/1759776706 -> viable/strict/1759776706 2025-12-04T08:56:59.8654845Z * [new tag] viable/strict/1759782317 -> viable/strict/1759782317 2025-12-04T08:56:59.8656167Z * [new tag] viable/strict/1759783777 -> viable/strict/1759783777 2025-12-04T08:56:59.8657510Z * [new tag] viable/strict/1759785815 -> viable/strict/1759785815 2025-12-04T08:56:59.8658737Z * [new tag] viable/strict/1759789459 -> viable/strict/1759789459 2025-12-04T08:56:59.8659962Z * [new tag] viable/strict/1759790974 -> viable/strict/1759790974 2025-12-04T08:56:59.8661151Z * [new tag] viable/strict/1759794583 -> viable/strict/1759794583 2025-12-04T08:56:59.8662452Z * [new tag] viable/strict/1759797408 -> viable/strict/1759797408 2025-12-04T08:56:59.8663932Z * [new tag] viable/strict/1759799518 -> viable/strict/1759799518 2025-12-04T08:56:59.8665197Z * [new tag] viable/strict/1759804909 -> viable/strict/1759804909 2025-12-04T08:56:59.8666494Z * [new tag] viable/strict/1759807643 -> viable/strict/1759807643 2025-12-04T08:56:59.8667739Z * [new tag] viable/strict/1759809089 -> viable/strict/1759809089 2025-12-04T08:56:59.8668955Z * [new tag] viable/strict/1759811145 -> viable/strict/1759811145 2025-12-04T08:56:59.8670216Z * [new tag] viable/strict/1759812581 -> viable/strict/1759812581 2025-12-04T08:56:59.8671482Z * [new tag] viable/strict/1759814683 -> viable/strict/1759814683 2025-12-04T08:56:59.8672754Z * [new tag] viable/strict/1759821889 -> viable/strict/1759821889 2025-12-04T08:56:59.8674037Z * [new tag] viable/strict/1759823376 -> viable/strict/1759823376 2025-12-04T08:56:59.8675272Z * [new tag] viable/strict/1759827107 -> viable/strict/1759827107 2025-12-04T08:56:59.8676504Z * [new tag] viable/strict/1759830577 -> viable/strict/1759830577 2025-12-04T08:56:59.8677806Z * [new tag] viable/strict/1759832720 -> viable/strict/1759832720 2025-12-04T08:56:59.8679026Z * [new tag] viable/strict/1759842063 -> viable/strict/1759842063 2025-12-04T08:56:59.8680308Z * [new tag] viable/strict/1759847121 -> viable/strict/1759847121 2025-12-04T08:56:59.8681762Z * [new tag] viable/strict/1759850721 -> viable/strict/1759850721 2025-12-04T08:56:59.8683031Z * [new tag] viable/strict/1759857870 -> viable/strict/1759857870 2025-12-04T08:56:59.8684307Z * [new tag] viable/strict/1759863143 -> viable/strict/1759863143 2025-12-04T08:56:59.8685559Z * [new tag] viable/strict/1759875874 -> viable/strict/1759875874 2025-12-04T08:56:59.8686747Z * [new tag] viable/strict/1759877385 -> viable/strict/1759877385 2025-12-04T08:56:59.8687993Z * [new tag] viable/strict/1759883801 -> viable/strict/1759883801 2025-12-04T08:56:59.8689264Z * [new tag] viable/strict/1759885922 -> viable/strict/1759885922 2025-12-04T08:56:59.8690687Z * [new tag] viable/strict/1759888488 -> viable/strict/1759888488 2025-12-04T08:56:59.8691709Z * [new tag] viable/strict/1759895471 -> viable/strict/1759895471 2025-12-04T08:56:59.8693183Z * [new tag] viable/strict/1759904803 -> viable/strict/1759904803 2025-12-04T08:56:59.8694517Z * [new tag] viable/strict/1759908300 -> viable/strict/1759908300 2025-12-04T08:56:59.8695771Z * [new tag] viable/strict/1759915520 -> viable/strict/1759915520 2025-12-04T08:56:59.8709501Z * [new tag] viable/strict/1759916978 -> viable/strict/1759916978 2025-12-04T08:56:59.8709865Z * [new tag] viable/strict/1759930024 -> viable/strict/1759930024 2025-12-04T08:56:59.8710021Z * [new tag] viable/strict/1759948122 -> viable/strict/1759948122 2025-12-04T08:56:59.8710163Z * [new tag] viable/strict/1759952983 -> viable/strict/1759952983 2025-12-04T08:56:59.8710298Z * [new tag] viable/strict/1759955121 -> viable/strict/1759955121 2025-12-04T08:56:59.8710434Z * [new tag] viable/strict/1759962298 -> viable/strict/1759962298 2025-12-04T08:56:59.8710563Z * [new tag] viable/strict/1759965837 -> viable/strict/1759965837 2025-12-04T08:56:59.8710722Z * [new tag] viable/strict/1759970213 -> viable/strict/1759970213 2025-12-04T08:56:59.8710953Z * [new tag] viable/strict/1759974894 -> viable/strict/1759974894 2025-12-04T08:56:59.8711140Z * [new tag] viable/strict/1759977763 -> viable/strict/1759977763 2025-12-04T08:56:59.8711267Z * [new tag] viable/strict/1759979241 -> viable/strict/1759979241 2025-12-04T08:56:59.8711398Z * [new tag] viable/strict/1759985417 -> viable/strict/1759985417 2025-12-04T08:56:59.8712315Z * [new tag] viable/strict/1759987490 -> viable/strict/1759987490 2025-12-04T08:56:59.8713832Z * [new tag] viable/strict/1759996180 -> viable/strict/1759996180 2025-12-04T08:56:59.8714890Z * [new tag] viable/strict/1760065682 -> viable/strict/1760065682 2025-12-04T08:56:59.8716352Z * [new tag] viable/strict/1760066894 -> viable/strict/1760066894 2025-12-04T08:56:59.8717487Z * [new tag] viable/strict/1760070345 -> viable/strict/1760070345 2025-12-04T08:56:59.8718999Z * [new tag] viable/strict/1760089782 -> viable/strict/1760089782 2025-12-04T08:56:59.8720134Z * [new tag] viable/strict/1760091921 -> viable/strict/1760091921 2025-12-04T08:56:59.8722007Z * [new tag] viable/strict/1760127924 -> viable/strict/1760127924 2025-12-04T08:56:59.8722987Z * [new tag] viable/strict/1760129489 -> viable/strict/1760129489 2025-12-04T08:56:59.8724425Z * [new tag] viable/strict/1760132980 -> viable/strict/1760132980 2025-12-04T08:56:59.8725748Z * [new tag] viable/strict/1760135060 -> viable/strict/1760135060 2025-12-04T08:56:59.8727013Z * [new tag] viable/strict/1760215782 -> viable/strict/1760215782 2025-12-04T08:56:59.8728338Z * [new tag] viable/strict/1760273849 -> viable/strict/1760273849 2025-12-04T08:56:59.8729635Z * [new tag] viable/strict/1760275517 -> viable/strict/1760275517 2025-12-04T08:56:59.8730922Z * [new tag] viable/strict/1760276979 -> viable/strict/1760276979 2025-12-04T08:56:59.8732181Z * [new tag] viable/strict/1760279007 -> viable/strict/1760279007 2025-12-04T08:56:59.8733608Z * [new tag] viable/strict/1760286328 -> viable/strict/1760286328 2025-12-04T08:56:59.8734798Z * [new tag] viable/strict/1760493304 -> viable/strict/1760493304 2025-12-04T08:56:59.8736063Z * [new tag] viable/strict/1760496298 -> viable/strict/1760496298 2025-12-04T08:56:59.8737438Z * [new tag] viable/strict/1760518396 -> viable/strict/1760518396 2025-12-04T08:56:59.8738711Z * [new tag] viable/strict/1760534864 -> viable/strict/1760534864 2025-12-04T08:56:59.8739909Z * [new tag] viable/strict/1760549062 -> viable/strict/1760549062 2025-12-04T08:56:59.8741242Z * [new tag] viable/strict/1760552799 -> viable/strict/1760552799 2025-12-04T08:56:59.8742499Z * [new tag] viable/strict/1760554355 -> viable/strict/1760554355 2025-12-04T08:56:59.8743823Z * [new tag] viable/strict/1760556275 -> viable/strict/1760556275 2025-12-04T08:56:59.8745095Z * [new tag] viable/strict/1760564979 -> viable/strict/1760564979 2025-12-04T08:56:59.8746451Z * [new tag] viable/strict/1760567049 -> viable/strict/1760567049 2025-12-04T08:56:59.8748014Z * [new tag] viable/strict/1760568585 -> viable/strict/1760568585 2025-12-04T08:56:59.8749275Z * [new tag] viable/strict/1760570630 -> viable/strict/1760570630 2025-12-04T08:56:59.8750559Z * [new tag] viable/strict/1760572180 -> viable/strict/1760572180 2025-12-04T08:56:59.8751824Z * [new tag] viable/strict/1760575094 -> viable/strict/1760575094 2025-12-04T08:56:59.8753159Z * [new tag] viable/strict/1760579709 -> viable/strict/1760579709 2025-12-04T08:56:59.8754773Z * [new tag] viable/strict/1760582614 -> viable/strict/1760582614 2025-12-04T08:56:59.8756141Z * [new tag] viable/strict/1760586815 -> viable/strict/1760586815 2025-12-04T08:56:59.8757333Z * [new tag] viable/strict/1760588829 -> viable/strict/1760588829 2025-12-04T08:56:59.8758564Z * [new tag] viable/strict/1760590200 -> viable/strict/1760590200 2025-12-04T08:56:59.8759836Z * [new tag] viable/strict/1760592311 -> viable/strict/1760592311 2025-12-04T08:56:59.8761109Z * [new tag] viable/strict/1760619733 -> viable/strict/1760619733 2025-12-04T08:56:59.8762205Z * [new tag] viable/strict/1760628335 -> viable/strict/1760628335 2025-12-04T08:56:59.8763668Z * [new tag] viable/strict/1760635490 -> viable/strict/1760635490 2025-12-04T08:56:59.8765065Z * [new tag] viable/strict/1760640743 -> viable/strict/1760640743 2025-12-04T08:56:59.8766315Z * [new tag] viable/strict/1760642528 -> viable/strict/1760642528 2025-12-04T08:56:59.8767613Z * [new tag] viable/strict/1760646330 -> viable/strict/1760646330 2025-12-04T08:56:59.8768856Z * [new tag] viable/strict/1760666101 -> viable/strict/1760666101 2025-12-04T08:56:59.8770195Z * [new tag] viable/strict/1760668990 -> viable/strict/1760668990 2025-12-04T08:56:59.8771487Z * [new tag] viable/strict/1760670600 -> viable/strict/1760670600 2025-12-04T08:56:59.8772829Z * [new tag] viable/strict/1760671704 -> viable/strict/1760671704 2025-12-04T08:56:59.8774166Z * [new tag] viable/strict/1760673121 -> viable/strict/1760673121 2025-12-04T08:56:59.8775416Z * [new tag] viable/strict/1760675352 -> viable/strict/1760675352 2025-12-04T08:56:59.8776710Z * [new tag] viable/strict/1760696731 -> viable/strict/1760696731 2025-12-04T08:56:59.8779071Z * [new tag] viable/strict/1760723515 -> viable/strict/1760723515 2025-12-04T08:56:59.8780336Z * [new tag] viable/strict/1760727234 -> viable/strict/1760727234 2025-12-04T08:56:59.8781587Z * [new tag] viable/strict/1760730578 -> viable/strict/1760730578 2025-12-04T08:56:59.8782849Z * [new tag] viable/strict/1760732726 -> viable/strict/1760732726 2025-12-04T08:56:59.8784158Z * [new tag] viable/strict/1760734180 -> viable/strict/1760734180 2025-12-04T08:56:59.8785575Z * [new tag] viable/strict/1760736251 -> viable/strict/1760736251 2025-12-04T08:56:59.8786645Z * [new tag] viable/strict/1760737772 -> viable/strict/1760737772 2025-12-04T08:56:59.8788191Z * [new tag] viable/strict/1760758005 -> viable/strict/1760758005 2025-12-04T08:56:59.8789445Z * [new tag] viable/strict/1760761532 -> viable/strict/1760761532 2025-12-04T08:56:59.8790743Z * [new tag] viable/strict/1760802581 -> viable/strict/1760802581 2025-12-04T08:56:59.8791995Z * [new tag] viable/strict/1760827772 -> viable/strict/1760827772 2025-12-04T08:56:59.8793276Z * [new tag] viable/strict/1760834524 -> viable/strict/1760834524 2025-12-04T08:56:59.8794586Z * [new tag] viable/strict/1760845009 -> viable/strict/1760845009 2025-12-04T08:56:59.8795872Z * [new tag] viable/strict/1760876836 -> viable/strict/1760876836 2025-12-04T08:56:59.8797133Z * [new tag] viable/strict/1760880329 -> viable/strict/1760880329 2025-12-04T08:56:59.8798408Z * [new tag] viable/strict/1760888987 -> viable/strict/1760888987 2025-12-04T08:56:59.8799646Z * [new tag] viable/strict/1760912664 -> viable/strict/1760912664 2025-12-04T08:56:59.8800901Z * [new tag] viable/strict/1760925321 -> viable/strict/1760925321 2025-12-04T08:56:59.8802137Z * [new tag] viable/strict/1760931488 -> viable/strict/1760931488 2025-12-04T08:56:59.8803343Z * [new tag] viable/strict/1760932693 -> viable/strict/1760932693 2025-12-04T08:56:59.8804700Z * [new tag] viable/strict/1761004184 -> viable/strict/1761004184 2025-12-04T08:56:59.8806377Z * [new tag] viable/strict/1761014748 -> viable/strict/1761014748 2025-12-04T08:56:59.8807662Z * [new tag] viable/strict/1761017491 -> viable/strict/1761017491 2025-12-04T08:56:59.8808921Z * [new tag] viable/strict/1761018806 -> viable/strict/1761018806 2025-12-04T08:56:59.8810299Z * [new tag] viable/strict/1761020754 -> viable/strict/1761020754 2025-12-04T08:56:59.8811534Z * [new tag] viable/strict/1761024303 -> viable/strict/1761024303 2025-12-04T08:56:59.8812915Z * [new tag] viable/strict/1761029582 -> viable/strict/1761029582 2025-12-04T08:56:59.8814245Z * [new tag] viable/strict/1761031535 -> viable/strict/1761031535 2025-12-04T08:56:59.8815451Z * [new tag] viable/strict/1761035196 -> viable/strict/1761035196 2025-12-04T08:56:59.8816753Z * [new tag] viable/strict/1761045825 -> viable/strict/1761045825 2025-12-04T08:56:59.8818063Z * [new tag] viable/strict/1761054796 -> viable/strict/1761054796 2025-12-04T08:56:59.8819383Z * [new tag] viable/strict/1761060314 -> viable/strict/1761060314 2025-12-04T08:56:59.8820705Z * [new tag] viable/strict/1761071198 -> viable/strict/1761071198 2025-12-04T08:56:59.8821989Z * [new tag] viable/strict/1761074628 -> viable/strict/1761074628 2025-12-04T08:56:59.8823260Z * [new tag] viable/strict/1761078351 -> viable/strict/1761078351 2025-12-04T08:56:59.8824541Z * [new tag] viable/strict/1761079822 -> viable/strict/1761079822 2025-12-04T08:56:59.8825827Z * [new tag] viable/strict/1761081873 -> viable/strict/1761081873 2025-12-04T08:56:59.8827126Z * [new tag] viable/strict/1761083392 -> viable/strict/1761083392 2025-12-04T08:56:59.8828400Z * [new tag] viable/strict/1761085465 -> viable/strict/1761085465 2025-12-04T08:56:59.8829732Z * [new tag] viable/strict/1761089099 -> viable/strict/1761089099 2025-12-04T08:56:59.8830951Z * [new tag] viable/strict/1761095535 -> viable/strict/1761095535 2025-12-04T08:56:59.8832266Z * [new tag] viable/strict/1761098119 -> viable/strict/1761098119 2025-12-04T08:56:59.8833812Z * [new tag] viable/strict/1761101330 -> viable/strict/1761101330 2025-12-04T08:56:59.8835162Z * [new tag] viable/strict/1761114425 -> viable/strict/1761114425 2025-12-04T08:56:59.8836422Z * [new tag] viable/strict/1761116036 -> viable/strict/1761116036 2025-12-04T08:56:59.8837644Z * [new tag] viable/strict/1761119379 -> viable/strict/1761119379 2025-12-04T08:56:59.8838917Z * [new tag] viable/strict/1761121601 -> viable/strict/1761121601 2025-12-04T08:56:59.8840182Z * [new tag] viable/strict/1761123234 -> viable/strict/1761123234 2025-12-04T08:56:59.8841405Z * [new tag] viable/strict/1761126621 -> viable/strict/1761126621 2025-12-04T08:56:59.8842646Z * [new tag] viable/strict/1761132259 -> viable/strict/1761132259 2025-12-04T08:56:59.8843998Z * [new tag] viable/strict/1761146746 -> viable/strict/1761146746 2025-12-04T08:56:59.8845250Z * [new tag] viable/strict/1761164752 -> viable/strict/1761164752 2025-12-04T08:56:59.8846485Z * [new tag] viable/strict/1761166198 -> viable/strict/1761166198 2025-12-04T08:56:59.8847805Z * [new tag] viable/strict/1761175424 -> viable/strict/1761175424 2025-12-04T08:56:59.8849077Z * [new tag] viable/strict/1761176983 -> viable/strict/1761176983 2025-12-04T08:56:59.8850499Z * [new tag] viable/strict/1761179891 -> viable/strict/1761179891 2025-12-04T08:56:59.8851749Z * [new tag] viable/strict/1761181930 -> viable/strict/1761181930 2025-12-04T08:56:59.8853137Z * [new tag] viable/strict/1761184516 -> viable/strict/1761184516 2025-12-04T08:56:59.8854552Z * [new tag] viable/strict/1761190179 -> viable/strict/1761190179 2025-12-04T08:56:59.8855791Z * [new tag] viable/strict/1761193558 -> viable/strict/1761193558 2025-12-04T08:56:59.8857024Z * [new tag] viable/strict/1761207990 -> viable/strict/1761207990 2025-12-04T08:56:59.8858294Z * [new tag] viable/strict/1761229539 -> viable/strict/1761229539 2025-12-04T08:56:59.8859709Z * [new tag] viable/strict/1761244031 -> viable/strict/1761244031 2025-12-04T08:56:59.8861013Z * [new tag] viable/strict/1761248986 -> viable/strict/1761248986 2025-12-04T08:56:59.8862288Z * [new tag] viable/strict/1761259791 -> viable/strict/1761259791 2025-12-04T08:56:59.8865662Z * [new tag] viable/strict/1761266139 -> viable/strict/1761266139 2025-12-04T08:56:59.8866975Z * [new tag] viable/strict/1761268316 -> viable/strict/1761268316 2025-12-04T08:56:59.8868273Z * [new tag] viable/strict/1761273805 -> viable/strict/1761273805 2025-12-04T08:56:59.8869549Z * [new tag] viable/strict/1761275261 -> viable/strict/1761275261 2025-12-04T08:56:59.8870875Z * [new tag] viable/strict/1761277913 -> viable/strict/1761277913 2025-12-04T08:56:59.8872185Z * [new tag] viable/strict/1761290701 -> viable/strict/1761290701 2025-12-04T08:56:59.8873479Z * [new tag] viable/strict/1761294396 -> viable/strict/1761294396 2025-12-04T08:56:59.8874734Z * [new tag] viable/strict/1761303047 -> viable/strict/1761303047 2025-12-04T08:56:59.8876005Z * [new tag] viable/strict/1761335388 -> viable/strict/1761335388 2025-12-04T08:56:59.8877277Z * [new tag] viable/strict/1761337551 -> viable/strict/1761337551 2025-12-04T08:56:59.8878527Z * [new tag] viable/strict/1761339007 -> viable/strict/1761339007 2025-12-04T08:56:59.8879789Z * [new tag] viable/strict/1761341050 -> viable/strict/1761341050 2025-12-04T08:56:59.8881202Z * [new tag] viable/strict/1761346188 -> viable/strict/1761346188 2025-12-04T08:56:59.8882576Z * [new tag] viable/strict/1761349792 -> viable/strict/1761349792 2025-12-04T08:56:59.8883794Z * [new tag] viable/strict/1761352620 -> viable/strict/1761352620 2025-12-04T08:56:59.8885043Z * [new tag] viable/strict/1761354730 -> viable/strict/1761354730 2025-12-04T08:56:59.8886325Z * [new tag] viable/strict/1761357298 -> viable/strict/1761357298 2025-12-04T08:56:59.8887571Z * [new tag] viable/strict/1761360201 -> viable/strict/1761360201 2025-12-04T08:56:59.8888834Z * [new tag] viable/strict/1761361753 -> viable/strict/1761361753 2025-12-04T08:56:59.8890185Z * [new tag] viable/strict/1761364351 -> viable/strict/1761364351 2025-12-04T08:56:59.8891845Z * [new tag] viable/strict/1761366338 -> viable/strict/1761366338 2025-12-04T08:56:59.8893379Z * [new tag] viable/strict/1761367802 -> viable/strict/1761367802 2025-12-04T08:56:59.8894635Z * [new tag] viable/strict/1761369889 -> viable/strict/1761369889 2025-12-04T08:56:59.8895969Z * [new tag] viable/strict/1761371385 -> viable/strict/1761371385 2025-12-04T08:56:59.8897273Z * [new tag] viable/strict/1761373581 -> viable/strict/1761373581 2025-12-04T08:56:59.8898724Z * [new tag] viable/strict/1761375054 -> viable/strict/1761375054 2025-12-04T08:56:59.8899980Z * [new tag] viable/strict/1761421785 -> viable/strict/1761421785 2025-12-04T08:56:59.8901319Z * [new tag] viable/strict/1761434614 -> viable/strict/1761434614 2025-12-04T08:56:59.8902834Z * [new tag] viable/strict/1761439254 -> viable/strict/1761439254 2025-12-04T08:56:59.8904269Z * [new tag] viable/strict/1761454187 -> viable/strict/1761454187 2025-12-04T08:56:59.8905557Z * [new tag] viable/strict/1761459991 -> viable/strict/1761459991 2025-12-04T08:56:59.8906962Z * [new tag] viable/strict/1761470668 -> viable/strict/1761470668 2025-12-04T08:56:59.8908556Z * [new tag] viable/strict/1761472188 -> viable/strict/1761472188 2025-12-04T08:56:59.8909924Z * [new tag] viable/strict/1761503178 -> viable/strict/1761503178 2025-12-04T08:56:59.8911202Z * [new tag] viable/strict/1761517492 -> viable/strict/1761517492 2025-12-04T08:56:59.8912495Z * [new tag] viable/strict/1761518981 -> viable/strict/1761518981 2025-12-04T08:56:59.8913848Z * [new tag] viable/strict/1761533609 -> viable/strict/1761533609 2025-12-04T08:56:59.8915094Z * [new tag] viable/strict/1761546438 -> viable/strict/1761546438 2025-12-04T08:56:59.8916392Z * [new tag] viable/strict/1761548133 -> viable/strict/1761548133 2025-12-04T08:56:59.8917878Z * [new tag] viable/strict/1761555186 -> viable/strict/1761555186 2025-12-04T08:56:59.8919261Z * [new tag] viable/strict/1761557178 -> viable/strict/1761557178 2025-12-04T08:56:59.8920566Z * [new tag] viable/strict/1761560772 -> viable/strict/1761560772 2025-12-04T08:56:59.8921903Z * [new tag] viable/strict/1761562266 -> viable/strict/1761562266 2025-12-04T08:56:59.8923205Z * [new tag] viable/strict/1761564260 -> viable/strict/1761564260 2025-12-04T08:56:59.8924497Z * [new tag] viable/strict/1761568072 -> viable/strict/1761568072 2025-12-04T08:56:59.8925791Z * [new tag] viable/strict/1761571683 -> viable/strict/1761571683 2025-12-04T08:56:59.8926960Z * [new tag] viable/strict/1761580199 -> viable/strict/1761580199 2025-12-04T08:56:59.8928274Z * [new tag] viable/strict/1761587383 -> viable/strict/1761587383 2025-12-04T08:56:59.8929698Z * [new tag] viable/strict/1761591165 -> viable/strict/1761591165 2025-12-04T08:56:59.8931009Z * [new tag] viable/strict/1761594575 -> viable/strict/1761594575 2025-12-04T08:56:59.8932224Z * [new tag] viable/strict/1761596710 -> viable/strict/1761596710 2025-12-04T08:56:59.8933598Z * [new tag] viable/strict/1761598189 -> viable/strict/1761598189 2025-12-04T08:56:59.8934879Z * [new tag] viable/strict/1761600254 -> viable/strict/1761600254 2025-12-04T08:56:59.8936199Z * [new tag] viable/strict/1761603879 -> viable/strict/1761603879 2025-12-04T08:56:59.8937545Z * [new tag] viable/strict/1761605429 -> viable/strict/1761605429 2025-12-04T08:56:59.8938889Z * [new tag] viable/strict/1761607468 -> viable/strict/1761607468 2025-12-04T08:56:59.8940183Z * [new tag] viable/strict/1761608983 -> viable/strict/1761608983 2025-12-04T08:56:59.8941641Z * [new tag] viable/strict/1761611846 -> viable/strict/1761611846 2025-12-04T08:56:59.8943015Z * [new tag] viable/strict/1761613922 -> viable/strict/1761613922 2025-12-04T08:56:59.8944217Z * [new tag] viable/strict/1761616504 -> viable/strict/1761616504 2025-12-04T08:56:59.8945390Z * [new tag] viable/strict/1761619599 -> viable/strict/1761619599 2025-12-04T08:56:59.8946646Z * [new tag] viable/strict/1761686693 -> viable/strict/1761686693 2025-12-04T08:56:59.8947899Z * [new tag] viable/strict/1761688179 -> viable/strict/1761688179 2025-12-04T08:56:59.8949241Z * [new tag] viable/strict/1761691973 -> viable/strict/1761691973 2025-12-04T08:56:59.8950652Z * [new tag] viable/strict/1761693884 -> viable/strict/1761693884 2025-12-04T08:56:59.8951917Z * [new tag] viable/strict/1761695389 -> viable/strict/1761695389 2025-12-04T08:56:59.8953274Z * [new tag] viable/strict/1761698408 -> viable/strict/1761698408 2025-12-04T08:56:59.8954535Z * [new tag] viable/strict/1761702931 -> viable/strict/1761702931 2025-12-04T08:56:59.8955964Z * [new tag] viable/strict/1761706307 -> viable/strict/1761706307 2025-12-04T08:56:59.8957288Z * [new tag] viable/strict/1761709065 -> viable/strict/1761709065 2025-12-04T08:56:59.8958664Z * [new tag] viable/strict/1761710285 -> viable/strict/1761710285 2025-12-04T08:56:59.8960013Z * [new tag] viable/strict/1761711983 -> viable/strict/1761711983 2025-12-04T08:56:59.8961349Z * [new tag] viable/strict/1761713514 -> viable/strict/1761713514 2025-12-04T08:56:59.8962936Z * [new tag] viable/strict/1761715523 -> viable/strict/1761715523 2025-12-04T08:56:59.8964329Z * [new tag] viable/strict/1761727973 -> viable/strict/1761727973 2025-12-04T08:56:59.8965659Z * [new tag] viable/strict/1761751558 -> viable/strict/1761751558 2025-12-04T08:56:59.8966995Z * [new tag] viable/strict/1761755187 -> viable/strict/1761755187 2025-12-04T08:56:59.8968374Z * [new tag] viable/strict/1761756826 -> viable/strict/1761756826 2025-12-04T08:56:59.8969704Z * [new tag] viable/strict/1761769551 -> viable/strict/1761769551 2025-12-04T08:56:59.8971120Z * [new tag] viable/strict/1761771032 -> viable/strict/1761771032 2025-12-04T08:56:59.8972411Z * [new tag] viable/strict/1761773101 -> viable/strict/1761773101 2025-12-04T08:56:59.8973758Z * [new tag] viable/strict/1761781792 -> viable/strict/1761781792 2025-12-04T08:56:59.8975134Z * [new tag] viable/strict/1761784788 -> viable/strict/1761784788 2025-12-04T08:56:59.8976491Z * [new tag] viable/strict/1761786740 -> viable/strict/1761786740 2025-12-04T08:56:59.8978387Z * [new tag] viable/strict/1761789332 -> viable/strict/1761789332 2025-12-04T08:56:59.8979992Z * [new tag] viable/strict/1761792569 -> viable/strict/1761792569 2025-12-04T08:56:59.8981301Z * [new tag] viable/strict/1761795289 -> viable/strict/1761795289 2025-12-04T08:56:59.8982632Z * [new tag] viable/strict/1761798345 -> viable/strict/1761798345 2025-12-04T08:56:59.8983983Z * [new tag] viable/strict/1761799827 -> viable/strict/1761799827 2025-12-04T08:56:59.8985429Z * [new tag] viable/strict/1761805604 -> viable/strict/1761805604 2025-12-04T08:56:59.8986743Z * [new tag] viable/strict/1761807202 -> viable/strict/1761807202 2025-12-04T08:56:59.8988088Z * [new tag] viable/strict/1761809094 -> viable/strict/1761809094 2025-12-04T08:56:59.8989430Z * [new tag] viable/strict/1761810576 -> viable/strict/1761810576 2025-12-04T08:56:59.8990759Z * [new tag] viable/strict/1761812771 -> viable/strict/1761812771 2025-12-04T08:56:59.8992112Z * [new tag] viable/strict/1761814363 -> viable/strict/1761814363 2025-12-04T08:56:59.8993446Z * [new tag] viable/strict/1761857410 -> viable/strict/1761857410 2025-12-04T08:56:59.8994797Z * [new tag] viable/strict/1761860985 -> viable/strict/1761860985 2025-12-04T08:56:59.8996109Z * [new tag] viable/strict/1761863094 -> viable/strict/1761863094 2025-12-04T08:56:59.8997411Z * [new tag] viable/strict/1761864590 -> viable/strict/1761864590 2025-12-04T08:56:59.8998723Z * [new tag] viable/strict/1761866675 -> viable/strict/1761866675 2025-12-04T08:56:59.9000253Z * [new tag] viable/strict/1761868178 -> viable/strict/1761868178 2025-12-04T08:56:59.9001542Z * [new tag] viable/strict/1761871111 -> viable/strict/1761871111 2025-12-04T08:56:59.9002863Z * [new tag] viable/strict/1761873126 -> viable/strict/1761873126 2025-12-04T08:56:59.9004258Z * [new tag] viable/strict/1761875714 -> viable/strict/1761875714 2025-12-04T08:56:59.9005688Z * [new tag] viable/strict/1761878924 -> viable/strict/1761878924 2025-12-04T08:56:59.9006994Z * [new tag] viable/strict/1761881727 -> viable/strict/1761881727 2025-12-04T08:56:59.9008299Z * [new tag] viable/strict/1761882959 -> viable/strict/1761882959 2025-12-04T08:56:59.9009625Z * [new tag] viable/strict/1761886268 -> viable/strict/1761886268 2025-12-04T08:56:59.9010982Z * [new tag] viable/strict/1761893641 -> viable/strict/1761893641 2025-12-04T08:56:59.9012332Z * [new tag] viable/strict/1761931517 -> viable/strict/1761931517 2025-12-04T08:56:59.9013741Z * [new tag] viable/strict/1761933080 -> viable/strict/1761933080 2025-12-04T08:56:59.9015044Z * [new tag] viable/strict/1761935217 -> viable/strict/1761935217 2025-12-04T08:56:59.9016455Z * [new tag] viable/strict/1761938533 -> viable/strict/1761938533 2025-12-04T08:56:59.9017786Z * [new tag] viable/strict/1761940184 -> viable/strict/1761940184 2025-12-04T08:56:59.9019094Z * [new tag] viable/strict/1761942338 -> viable/strict/1761942338 2025-12-04T08:56:59.9020441Z * [new tag] viable/strict/1761946100 -> viable/strict/1761946100 2025-12-04T08:56:59.9021780Z * [new tag] viable/strict/1761947374 -> viable/strict/1761947374 2025-12-04T08:56:59.9023267Z * [new tag] viable/strict/1761950978 -> viable/strict/1761950978 2025-12-04T08:56:59.9024606Z * [new tag] viable/strict/1761957727 -> viable/strict/1761957727 2025-12-04T08:56:59.9025968Z * [new tag] viable/strict/1761959532 -> viable/strict/1761959532 2025-12-04T08:56:59.9027446Z * [new tag] viable/strict/1761965366 -> viable/strict/1761965366 2025-12-04T08:56:59.9028858Z * [new tag] viable/strict/1761968066 -> viable/strict/1761968066 2025-12-04T08:56:59.9030212Z * [new tag] viable/strict/1761969322 -> viable/strict/1761969322 2025-12-04T08:56:59.9031507Z * [new tag] viable/strict/1761974723 -> viable/strict/1761974723 2025-12-04T08:56:59.9032842Z * [new tag] viable/strict/1761981837 -> viable/strict/1761981837 2025-12-04T08:56:59.9034241Z * [new tag] viable/strict/1761985546 -> viable/strict/1761985546 2025-12-04T08:56:59.9035605Z * [new tag] viable/strict/1761987030 -> viable/strict/1761987030 2025-12-04T08:56:59.9036959Z * [new tag] viable/strict/1762003554 -> viable/strict/1762003554 2025-12-04T08:56:59.9038283Z * [new tag] viable/strict/1762021560 -> viable/strict/1762021560 2025-12-04T08:56:59.9039623Z * [new tag] viable/strict/1762032190 -> viable/strict/1762032190 2025-12-04T08:56:59.9040949Z * [new tag] viable/strict/1762040981 -> viable/strict/1762040981 2025-12-04T08:56:59.9042309Z * [new tag] viable/strict/1762048525 -> viable/strict/1762048525 2025-12-04T08:56:59.9043652Z * [new tag] viable/strict/1762104223 -> viable/strict/1762104223 2025-12-04T08:56:59.9045197Z * [new tag] viable/strict/1762105778 -> viable/strict/1762105778 2025-12-04T08:56:59.9046522Z * [new tag] viable/strict/1762115109 -> viable/strict/1762115109 2025-12-04T08:56:59.9047867Z * [new tag] viable/strict/1762125840 -> viable/strict/1762125840 2025-12-04T08:56:59.9049068Z * [new tag] viable/strict/1762127377 -> viable/strict/1762127377 2025-12-04T08:56:59.9050641Z * [new tag] viable/strict/1762134925 -> viable/strict/1762134925 2025-12-04T08:56:59.9051856Z * [new tag] viable/strict/1762138338 -> viable/strict/1762138338 2025-12-04T08:56:59.9053400Z * [new tag] viable/strict/1762148993 -> viable/strict/1762148993 2025-12-04T08:56:59.9054776Z * [new tag] viable/strict/1762152871 -> viable/strict/1762152871 2025-12-04T08:56:59.9056318Z * [new tag] viable/strict/1762156183 -> viable/strict/1762156183 2025-12-04T08:56:59.9057656Z * [new tag] viable/strict/1762163457 -> viable/strict/1762163457 2025-12-04T08:56:59.9059022Z * [new tag] viable/strict/1762165569 -> viable/strict/1762165569 2025-12-04T08:56:59.9060401Z * [new tag] viable/strict/1762169035 -> viable/strict/1762169035 2025-12-04T08:56:59.9061722Z * [new tag] viable/strict/1762174936 -> viable/strict/1762174936 2025-12-04T08:56:59.9063294Z * [new tag] viable/strict/1762194412 -> viable/strict/1762194412 2025-12-04T08:56:59.9064649Z * [new tag] viable/strict/1762195876 -> viable/strict/1762195876 2025-12-04T08:56:59.9066418Z * [new tag] viable/strict/1762197788 -> viable/strict/1762197788 2025-12-04T08:56:59.9067750Z * [new tag] viable/strict/1762199389 -> viable/strict/1762199389 2025-12-04T08:56:59.9069283Z * [new tag] viable/strict/1762206585 -> viable/strict/1762206585 2025-12-04T08:56:59.9070697Z * [new tag] viable/strict/1762210184 -> viable/strict/1762210184 2025-12-04T08:56:59.9071899Z * [new tag] viable/strict/1762218736 -> viable/strict/1762218736 2025-12-04T08:56:59.9073369Z * [new tag] viable/strict/1762224529 -> viable/strict/1762224529 2025-12-04T08:56:59.9074758Z * [new tag] viable/strict/1762227253 -> viable/strict/1762227253 2025-12-04T08:56:59.9076003Z * [new tag] viable/strict/1762228515 -> viable/strict/1762228515 2025-12-04T08:56:59.9077538Z * [new tag] viable/strict/1762230349 -> viable/strict/1762230349 2025-12-04T08:56:59.9078769Z * [new tag] viable/strict/1762231859 -> viable/strict/1762231859 2025-12-04T08:56:59.9080063Z * [new tag] viable/strict/1762233925 -> viable/strict/1762233925 2025-12-04T08:56:59.9081512Z * [new tag] viable/strict/1762237630 -> viable/strict/1762237630 2025-12-04T08:56:59.9082734Z * [new tag] viable/strict/1762253522 -> viable/strict/1762253522 2025-12-04T08:56:59.9084102Z * [new tag] viable/strict/1762278588 -> viable/strict/1762278588 2025-12-04T08:56:59.9085423Z * [new tag] viable/strict/1762284203 -> viable/strict/1762284203 2025-12-04T08:56:59.9086890Z * [new tag] viable/strict/1762289446 -> viable/strict/1762289446 2025-12-04T08:56:59.9088227Z * [new tag] viable/strict/1762291515 -> viable/strict/1762291515 2025-12-04T08:56:59.9089615Z * [new tag] viable/strict/1762295100 -> viable/strict/1762295100 2025-12-04T08:56:59.9090830Z * [new tag] viable/strict/1762296590 -> viable/strict/1762296590 2025-12-04T08:56:59.9092035Z * [new tag] viable/strict/1762300179 -> viable/strict/1762300179 2025-12-04T08:56:59.9093390Z * [new tag] viable/strict/1762303207 -> viable/strict/1762303207 2025-12-04T08:56:59.9094719Z * [new tag] viable/strict/1762386584 -> viable/strict/1762386584 2025-12-04T08:56:59.9096087Z * [new tag] viable/strict/1762391537 -> viable/strict/1762391537 2025-12-04T08:56:59.9097338Z * [new tag] viable/strict/1762394119 -> viable/strict/1762394119 2025-12-04T08:56:59.9098923Z * [new tag] viable/strict/1762397437 -> viable/strict/1762397437 2025-12-04T08:56:59.9100252Z * [new tag] viable/strict/1762400256 -> viable/strict/1762400256 2025-12-04T08:56:59.9101590Z * [new tag] viable/strict/1762401469 -> viable/strict/1762401469 2025-12-04T08:56:59.9102979Z * [new tag] viable/strict/1762408195 -> viable/strict/1762408195 2025-12-04T08:56:59.9104344Z * [new tag] viable/strict/1762410411 -> viable/strict/1762410411 2025-12-04T08:56:59.9105781Z * [new tag] viable/strict/1762417613 -> viable/strict/1762417613 2025-12-04T08:56:59.9107153Z * [new tag] viable/strict/1762419198 -> viable/strict/1762419198 2025-12-04T08:56:59.9108586Z * [new tag] viable/strict/1762422656 -> viable/strict/1762422656 2025-12-04T08:56:59.9110196Z * [new tag] viable/strict/1762424746 -> viable/strict/1762424746 2025-12-04T08:56:59.9111578Z * [new tag] viable/strict/1762446386 -> viable/strict/1762446386 2025-12-04T08:56:59.9112912Z * [new tag] viable/strict/1762449912 -> viable/strict/1762449912 2025-12-04T08:56:59.9114307Z * [new tag] viable/strict/1762457031 -> viable/strict/1762457031 2025-12-04T08:56:59.9116265Z * [new tag] viable/strict/1762462441 -> viable/strict/1762462441 2025-12-04T08:56:59.9117474Z * [new tag] viable/strict/1762467909 -> viable/strict/1762467909 2025-12-04T08:56:59.9118298Z * [new tag] viable/strict/1762471493 -> viable/strict/1762471493 2025-12-04T08:56:59.9119829Z * [new tag] viable/strict/1762475990 -> viable/strict/1762475990 2025-12-04T08:56:59.9121238Z * [new tag] viable/strict/1762477933 -> viable/strict/1762477933 2025-12-04T08:56:59.9122560Z * [new tag] viable/strict/1762491053 -> viable/strict/1762491053 2025-12-04T08:56:59.9123884Z * [new tag] viable/strict/1762493118 -> viable/strict/1762493118 2025-12-04T08:56:59.9125297Z * [new tag] viable/strict/1762498442 -> viable/strict/1762498442 2025-12-04T08:56:59.9126688Z * [new tag] viable/strict/1762501778 -> viable/strict/1762501778 2025-12-04T08:56:59.9127998Z * [new tag] viable/strict/1762504001 -> viable/strict/1762504001 2025-12-04T08:56:59.9129390Z * [new tag] viable/strict/1762505583 -> viable/strict/1762505583 2025-12-04T08:56:59.9130801Z * [new tag] viable/strict/1762507523 -> viable/strict/1762507523 2025-12-04T08:56:59.9132162Z * [new tag] viable/strict/1762511140 -> viable/strict/1762511140 2025-12-04T08:56:59.9133717Z * [new tag] viable/strict/1762512632 -> viable/strict/1762512632 2025-12-04T08:56:59.9135054Z * [new tag] viable/strict/1762520467 -> viable/strict/1762520467 2025-12-04T08:56:59.9136434Z * [new tag] viable/strict/1762522016 -> viable/strict/1762522016 2025-12-04T08:56:59.9137734Z * [new tag] viable/strict/1762530591 -> viable/strict/1762530591 2025-12-04T08:56:59.9139100Z * [new tag] viable/strict/1762543405 -> viable/strict/1762543405 2025-12-04T08:56:59.9140357Z * [new tag] viable/strict/1762544998 -> viable/strict/1762544998 2025-12-04T08:56:59.9141725Z * [new tag] viable/strict/1762552182 -> viable/strict/1762552182 2025-12-04T08:56:59.9143031Z * [new tag] viable/strict/1762554297 -> viable/strict/1762554297 2025-12-04T08:56:59.9144245Z * [new tag] viable/strict/1762559381 -> viable/strict/1762559381 2025-12-04T08:56:59.9145578Z * [new tag] viable/strict/1762562222 -> viable/strict/1762562222 2025-12-04T08:56:59.9146963Z * [new tag] viable/strict/1762564319 -> viable/strict/1762564319 2025-12-04T08:56:59.9148153Z * [new tag] viable/strict/1762566904 -> viable/strict/1762566904 2025-12-04T08:56:59.9149442Z * [new tag] viable/strict/1762569781 -> viable/strict/1762569781 2025-12-04T08:56:59.9150722Z * [new tag] viable/strict/1762575940 -> viable/strict/1762575940 2025-12-04T08:56:59.9152108Z * [new tag] viable/strict/1762580974 -> viable/strict/1762580974 2025-12-04T08:56:59.9153974Z * [new tag] viable/strict/1762583185 -> viable/strict/1762583185 2025-12-04T08:56:59.9155274Z * [new tag] viable/strict/1762586647 -> viable/strict/1762586647 2025-12-04T08:56:59.9156641Z * [new tag] viable/strict/1762588183 -> viable/strict/1762588183 2025-12-04T08:56:59.9158019Z * [new tag] viable/strict/1762593886 -> viable/strict/1762593886 2025-12-04T08:56:59.9159349Z * [new tag] viable/strict/1762650743 -> viable/strict/1762650743 2025-12-04T08:56:59.9160935Z * [new tag] viable/strict/1762653328 -> viable/strict/1762653328 2025-12-04T08:56:59.9162300Z * [new tag] viable/strict/1762659342 -> viable/strict/1762659342 2025-12-04T08:56:59.9163821Z * [new tag] viable/strict/1762662360 -> viable/strict/1762662360 2025-12-04T08:56:59.9165117Z * [new tag] viable/strict/1762667377 -> viable/strict/1762667377 2025-12-04T08:56:59.9166455Z * [new tag] viable/strict/1762671090 -> viable/strict/1762671090 2025-12-04T08:56:59.9167803Z * [new tag] viable/strict/1762680284 -> viable/strict/1762680284 2025-12-04T08:56:59.9169166Z * [new tag] viable/strict/1762683900 -> viable/strict/1762683900 2025-12-04T08:56:59.9170570Z * [new tag] viable/strict/1762705541 -> viable/strict/1762705541 2025-12-04T08:56:59.9171883Z * [new tag] viable/strict/1762709004 -> viable/strict/1762709004 2025-12-04T08:56:59.9173389Z * [new tag] viable/strict/1762746004 -> viable/strict/1762746004 2025-12-04T08:56:59.9174827Z * [new tag] viable/strict/1762748799 -> viable/strict/1762748799 2025-12-04T08:56:59.9176334Z * [new tag] viable/strict/1762759504 -> viable/strict/1762759504 2025-12-04T08:56:59.9177655Z * [new tag] viable/strict/1762760973 -> viable/strict/1762760973 2025-12-04T08:56:59.9178916Z * [new tag] viable/strict/1762775374 -> viable/strict/1762775374 2025-12-04T08:56:59.9180294Z * [new tag] viable/strict/1762777661 -> viable/strict/1762777661 2025-12-04T08:56:59.9181664Z * [new tag] viable/strict/1762779774 -> viable/strict/1762779774 2025-12-04T08:56:59.9183128Z * [new tag] viable/strict/1762781259 -> viable/strict/1762781259 2025-12-04T08:56:59.9184588Z * [new tag] viable/strict/1762793628 -> viable/strict/1762793628 2025-12-04T08:56:59.9185933Z * [new tag] viable/strict/1762800711 -> viable/strict/1762800711 2025-12-04T08:56:59.9187319Z * [new tag] viable/strict/1762809894 -> viable/strict/1762809894 2025-12-04T08:56:59.9188691Z * [new tag] viable/strict/1762811384 -> viable/strict/1762811384 2025-12-04T08:56:59.9190102Z * [new tag] viable/strict/1762813841 -> viable/strict/1762813841 2025-12-04T08:56:59.9191438Z * [new tag] viable/strict/1762815047 -> viable/strict/1762815047 2025-12-04T08:56:59.9192913Z * [new tag] viable/strict/1762817094 -> viable/strict/1762817094 2025-12-04T08:56:59.9194271Z * [new tag] viable/strict/1762818582 -> viable/strict/1762818582 2025-12-04T08:56:59.9195607Z * [new tag] viable/strict/1762821623 -> viable/strict/1762821623 2025-12-04T08:56:59.9196859Z * [new tag] viable/strict/1762823531 -> viable/strict/1762823531 2025-12-04T08:56:59.9198284Z * [new tag] viable/strict/1762849583 -> viable/strict/1762849583 2025-12-04T08:56:59.9199609Z * [new tag] viable/strict/1762851200 -> viable/strict/1762851200 2025-12-04T08:56:59.9200960Z * [new tag] viable/strict/1762854603 -> viable/strict/1762854603 2025-12-04T08:56:59.9202325Z * [new tag] viable/strict/1762858276 -> viable/strict/1762858276 2025-12-04T08:56:59.9203665Z * [new tag] viable/strict/1762860891 -> viable/strict/1762860891 2025-12-04T08:56:59.9205553Z * [new tag] viable/strict/1762866174 -> viable/strict/1762866174 2025-12-04T08:56:59.9206989Z * [new tag] viable/strict/1762867653 -> viable/strict/1762867653 2025-12-04T08:56:59.9208281Z * [new tag] viable/strict/1762872669 -> viable/strict/1762872669 2025-12-04T08:56:59.9209522Z * [new tag] viable/strict/1762878380 -> viable/strict/1762878380 2025-12-04T08:56:59.9210850Z * [new tag] viable/strict/1762889003 -> viable/strict/1762889003 2025-12-04T08:56:59.9212266Z * [new tag] viable/strict/1762890589 -> viable/strict/1762890589 2025-12-04T08:56:59.9213836Z * [new tag] viable/strict/1762892743 -> viable/strict/1762892743 2025-12-04T08:56:59.9215095Z * [new tag] viable/strict/1762894271 -> viable/strict/1762894271 2025-12-04T08:56:59.9216310Z * [new tag] viable/strict/1762896287 -> viable/strict/1762896287 2025-12-04T08:56:59.9217728Z * [new tag] viable/strict/1762915871 -> viable/strict/1762915871 2025-12-04T08:56:59.9219063Z * [new tag] viable/strict/1762918569 -> viable/strict/1762918569 2025-12-04T08:56:59.9220278Z * [new tag] viable/strict/1762919776 -> viable/strict/1762919776 2025-12-04T08:56:59.9221596Z * [new tag] viable/strict/1762923072 -> viable/strict/1762923072 2025-12-04T08:56:59.9223016Z * [new tag] viable/strict/1762928826 -> viable/strict/1762928826 2025-12-04T08:56:59.9224430Z * [new tag] viable/strict/1762930451 -> viable/strict/1762930451 2025-12-04T08:56:59.9225823Z * [new tag] viable/strict/1762933780 -> viable/strict/1762933780 2025-12-04T08:56:59.9227171Z * [new tag] viable/strict/1762937638 -> viable/strict/1762937638 2025-12-04T08:56:59.9228732Z * [new tag] viable/strict/1762939545 -> viable/strict/1762939545 2025-12-04T08:56:59.9230075Z * [new tag] viable/strict/1762962692 -> viable/strict/1762962692 2025-12-04T08:56:59.9231320Z * [new tag] viable/strict/1762979143 -> viable/strict/1762979143 2025-12-04T08:56:59.9232640Z * [new tag] viable/strict/1762984188 -> viable/strict/1762984188 2025-12-04T08:56:59.9233883Z * [new tag] viable/strict/1762986306 -> viable/strict/1762986306 2025-12-04T08:56:59.9235249Z * [new tag] viable/strict/1762989903 -> viable/strict/1762989903 2025-12-04T08:56:59.9236575Z * [new tag] viable/strict/1762991377 -> viable/strict/1762991377 2025-12-04T08:56:59.9237939Z * [new tag] viable/strict/1762998921 -> viable/strict/1762998921 2025-12-04T08:56:59.9239335Z * [new tag] viable/strict/1763002287 -> viable/strict/1763002287 2025-12-04T08:56:59.9240684Z * [new tag] viable/strict/1763016840 -> viable/strict/1763016840 2025-12-04T08:56:59.9242452Z * [new tag] viable/strict/1763020180 -> viable/strict/1763020180 2025-12-04T08:56:59.9243873Z * [new tag] viable/strict/1763027421 -> viable/strict/1763027421 2025-12-04T08:56:59.9245229Z * [new tag] viable/strict/1763031120 -> viable/strict/1763031120 2025-12-04T08:56:59.9246561Z * [new tag] viable/strict/1763036861 -> viable/strict/1763036861 2025-12-04T08:56:59.9247919Z * [new tag] viable/strict/1763038993 -> viable/strict/1763038993 2025-12-04T08:56:59.9249409Z * [new tag] viable/strict/1763054703 -> viable/strict/1763054703 2025-12-04T08:56:59.9250630Z * [new tag] viable/strict/1763067061 -> viable/strict/1763067061 2025-12-04T08:56:59.9252004Z * [new tag] viable/strict/1763070847 -> viable/strict/1763070847 2025-12-04T08:56:59.9253471Z * [new tag] viable/strict/1763072706 -> viable/strict/1763072706 2025-12-04T08:56:59.9254925Z * [new tag] viable/strict/1763076302 -> viable/strict/1763076302 2025-12-04T08:56:59.9256211Z * [new tag] viable/strict/1763080816 -> viable/strict/1763080816 2025-12-04T08:56:59.9257590Z * [new tag] viable/strict/1763082732 -> viable/strict/1763082732 2025-12-04T08:56:59.9258912Z * [new tag] viable/strict/1763085329 -> viable/strict/1763085329 2025-12-04T08:56:59.9260231Z * [new tag] viable/strict/1763088623 -> viable/strict/1763088623 2025-12-04T08:56:59.9261666Z * [new tag] viable/strict/1763091402 -> viable/strict/1763091402 2025-12-04T08:56:59.9265006Z * [new tag] viable/strict/1763092602 -> viable/strict/1763092602 2025-12-04T08:56:59.9266377Z * [new tag] viable/strict/1763094355 -> viable/strict/1763094355 2025-12-04T08:56:59.9267719Z * [new tag] viable/strict/1763099390 -> viable/strict/1763099390 2025-12-04T08:56:59.9269082Z * [new tag] viable/strict/1763101608 -> viable/strict/1763101608 2025-12-04T08:56:59.9270471Z * [new tag] viable/strict/1763105102 -> viable/strict/1763105102 2025-12-04T08:56:59.9271877Z * [new tag] viable/strict/1763112347 -> viable/strict/1763112347 2025-12-04T08:56:59.9273255Z * [new tag] viable/strict/1763119471 -> viable/strict/1763119471 2025-12-04T08:56:59.9274483Z * [new tag] viable/strict/1763126835 -> viable/strict/1763126835 2025-12-04T08:56:59.9275686Z * [new tag] viable/strict/1763149779 -> viable/strict/1763149779 2025-12-04T08:56:59.9277188Z * [new tag] viable/strict/1763164178 -> viable/strict/1763164178 2025-12-04T08:56:59.9278404Z * [new tag] viable/strict/1763167104 -> viable/strict/1763167104 2025-12-04T08:56:59.9279704Z * [new tag] viable/strict/1763169132 -> viable/strict/1763169132 2025-12-04T08:56:59.9281024Z * [new tag] viable/strict/1763171708 -> viable/strict/1763171708 2025-12-04T08:56:59.9282375Z * [new tag] viable/strict/1763174759 -> viable/strict/1763174759 2025-12-04T08:56:59.9283715Z * [new tag] viable/strict/1763180744 -> viable/strict/1763180744 2025-12-04T08:56:59.9285059Z * [new tag] viable/strict/1763182227 -> viable/strict/1763182227 2025-12-04T08:56:59.9286398Z * [new tag] viable/strict/1763184309 -> viable/strict/1763184309 2025-12-04T08:56:59.9288147Z * [new tag] viable/strict/1763187991 -> viable/strict/1763187991 2025-12-04T08:56:59.9289492Z * [new tag] viable/strict/1763191445 -> viable/strict/1763191445 2025-12-04T08:56:59.9290903Z * [new tag] viable/strict/1763195152 -> viable/strict/1763195152 2025-12-04T08:56:59.9292136Z * [new tag] viable/strict/1763205769 -> viable/strict/1763205769 2025-12-04T08:56:59.9293810Z * [new tag] viable/strict/1763246990 -> viable/strict/1763246990 2025-12-04T08:56:59.9295244Z * [new tag] viable/strict/1763261578 -> viable/strict/1763261578 2025-12-04T08:56:59.9296458Z * [new tag] viable/strict/1763286573 -> viable/strict/1763286573 2025-12-04T08:56:59.9297725Z * [new tag] viable/strict/1763292167 -> viable/strict/1763292167 2025-12-04T08:56:59.9299083Z * [new tag] viable/strict/1763333386 -> viable/strict/1763333386 2025-12-04T08:56:59.9300443Z * [new tag] viable/strict/1763340082 -> viable/strict/1763340082 2025-12-04T08:56:59.9302356Z * [new tag] viable/strict/1763364324 -> viable/strict/1763364324 2025-12-04T08:56:59.9303766Z * [new tag] viable/strict/1763371569 -> viable/strict/1763371569 2025-12-04T08:56:59.9305086Z * [new tag] viable/strict/1763373067 -> viable/strict/1763373067 2025-12-04T08:56:59.9306484Z * [new tag] viable/strict/1763375157 -> viable/strict/1763375157 2025-12-04T08:56:59.9307762Z * [new tag] viable/strict/1763382462 -> viable/strict/1763382462 2025-12-04T08:56:59.9309080Z * [new tag] viable/strict/1763394661 -> viable/strict/1763394661 2025-12-04T08:56:59.9310628Z * [new tag] viable/strict/1763396797 -> viable/strict/1763396797 2025-12-04T08:56:59.9312039Z * [new tag] viable/strict/1763398542 -> viable/strict/1763398542 2025-12-04T08:56:59.9313387Z * [new tag] viable/strict/1763401807 -> viable/strict/1763401807 2025-12-04T08:56:59.9314611Z * [new tag] viable/strict/1763414698 -> viable/strict/1763414698 2025-12-04T08:56:59.9315947Z * [new tag] viable/strict/1763419807 -> viable/strict/1763419807 2025-12-04T08:56:59.9317302Z * [new tag] viable/strict/1763426369 -> viable/strict/1763426369 2025-12-04T08:56:59.9318658Z * [new tag] viable/strict/1763428331 -> viable/strict/1763428331 2025-12-04T08:56:59.9320007Z * [new tag] viable/strict/1763430922 -> viable/strict/1763430922 2025-12-04T08:56:59.9321225Z * [new tag] viable/strict/1763434184 -> viable/strict/1763434184 2025-12-04T08:56:59.9322631Z * [new tag] viable/strict/1763439973 -> viable/strict/1763439973 2025-12-04T08:56:59.9323987Z * [new tag] viable/strict/1763444995 -> viable/strict/1763444995 2025-12-04T08:56:59.9325443Z * [new tag] viable/strict/1763447206 -> viable/strict/1763447206 2025-12-04T08:56:59.9326701Z * [new tag] viable/strict/1763448826 -> viable/strict/1763448826 2025-12-04T08:56:59.9328095Z * [new tag] viable/strict/1763450717 -> viable/strict/1763450717 2025-12-04T08:56:59.9329417Z * [new tag] viable/strict/1763452183 -> viable/strict/1763452183 2025-12-04T08:56:59.9330871Z * [new tag] viable/strict/1763457945 -> viable/strict/1763457945 2025-12-04T08:56:59.9332698Z * [new tag] viable/strict/1763459439 -> viable/strict/1763459439 2025-12-04T08:56:59.9333911Z * [new tag] viable/strict/1763461556 -> viable/strict/1763461556 2025-12-04T08:56:59.9335238Z * [new tag] viable/strict/1763463103 -> viable/strict/1763463103 2025-12-04T08:56:59.9336592Z * [new tag] viable/strict/1763465100 -> viable/strict/1763465100 2025-12-04T08:56:59.9337825Z * [new tag] viable/strict/1763468866 -> viable/strict/1763468866 2025-12-04T08:56:59.9339100Z * [new tag] viable/strict/1763493823 -> viable/strict/1763493823 2025-12-04T08:56:59.9340281Z * [new tag] viable/strict/1763496249 -> viable/strict/1763496249 2025-12-04T08:56:59.9341859Z * [new tag] viable/strict/1763502620 -> viable/strict/1763502620 2025-12-04T08:56:59.9343339Z * [new tag] viable/strict/1763504715 -> viable/strict/1763504715 2025-12-04T08:56:59.9344699Z * [new tag] viable/strict/1763506208 -> viable/strict/1763506208 2025-12-04T08:56:59.9346022Z * [new tag] viable/strict/1763520590 -> viable/strict/1763520590 2025-12-04T08:56:59.9347392Z * [new tag] viable/strict/1763523357 -> viable/strict/1763523357 2025-12-04T08:56:59.9348831Z * [new tag] viable/strict/1763529922 -> viable/strict/1763529922 2025-12-04T08:56:59.9350233Z * [new tag] viable/strict/1763531408 -> viable/strict/1763531408 2025-12-04T08:56:59.9351554Z * [new tag] viable/strict/1763533622 -> viable/strict/1763533622 2025-12-04T08:56:59.9352906Z * [new tag] viable/strict/1763538576 -> viable/strict/1763538576 2025-12-04T08:56:59.9354342Z * [new tag] viable/strict/1763545823 -> viable/strict/1763545823 2025-12-04T08:56:59.9355566Z * [new tag] viable/strict/1763547951 -> viable/strict/1763547951 2025-12-04T08:56:59.9356950Z * [new tag] viable/strict/1763551477 -> viable/strict/1763551477 2025-12-04T08:56:59.9358298Z * [new tag] viable/strict/1763552982 -> viable/strict/1763552982 2025-12-04T08:56:59.9359654Z * [new tag] viable/strict/1763594698 -> viable/strict/1763594698 2025-12-04T08:56:59.9361042Z * [new tag] viable/strict/1763596178 -> viable/strict/1763596178 2025-12-04T08:56:59.9362415Z * [new tag] viable/strict/1763599155 -> viable/strict/1763599155 2025-12-04T08:56:59.9363969Z * [new tag] viable/strict/1763603717 -> viable/strict/1763603717 2025-12-04T08:56:59.9365366Z * [new tag] viable/strict/1763606923 -> viable/strict/1763606923 2025-12-04T08:56:59.9366686Z * [new tag] viable/strict/1763609715 -> viable/strict/1763609715 2025-12-04T08:56:59.9368018Z * [new tag] viable/strict/1763612757 -> viable/strict/1763612757 2025-12-04T08:56:59.9369324Z * [new tag] viable/strict/1763616325 -> viable/strict/1763616325 2025-12-04T08:56:59.9370692Z * [new tag] viable/strict/1763623509 -> viable/strict/1763623509 2025-12-04T08:56:59.9372160Z * [new tag] viable/strict/1763624984 -> viable/strict/1763624984 2025-12-04T08:56:59.9373617Z * [new tag] viable/strict/1763628796 -> viable/strict/1763628796 2025-12-04T08:56:59.9375023Z * [new tag] viable/strict/1763634343 -> viable/strict/1763634343 2025-12-04T08:56:59.9376251Z * [new tag] viable/strict/1763635867 -> viable/strict/1763635867 2025-12-04T08:56:59.9377746Z * [new tag] viable/strict/1763639382 -> viable/strict/1763639382 2025-12-04T08:56:59.9379036Z * [new tag] viable/strict/1763646626 -> viable/strict/1763646626 2025-12-04T08:56:59.9380522Z * [new tag] viable/strict/1763655997 -> viable/strict/1763655997 2025-12-04T08:56:59.9381867Z * [new tag] viable/strict/1763659444 -> viable/strict/1763659444 2025-12-04T08:56:59.9383372Z * [new tag] viable/strict/1763660992 -> viable/strict/1763660992 2025-12-04T08:56:59.9384665Z * [new tag] viable/strict/1763663201 -> viable/strict/1763663201 2025-12-04T08:56:59.9386028Z * [new tag] viable/strict/1763670362 -> viable/strict/1763670362 2025-12-04T08:56:59.9387321Z * [new tag] viable/strict/1763675378 -> viable/strict/1763675378 2025-12-04T08:56:59.9388592Z * [new tag] viable/strict/1763693343 -> viable/strict/1763693343 2025-12-04T08:56:59.9389941Z * [new tag] viable/strict/1763696088 -> viable/strict/1763696088 2025-12-04T08:56:59.9391378Z * [new tag] viable/strict/1763697343 -> viable/strict/1763697343 2025-12-04T08:56:59.9392748Z * [new tag] viable/strict/1763699165 -> viable/strict/1763699165 2025-12-04T08:56:59.9394086Z * [new tag] viable/strict/1763700660 -> viable/strict/1763700660 2025-12-04T08:56:59.9395384Z * [new tag] viable/strict/1763704209 -> viable/strict/1763704209 2025-12-04T08:56:59.9396760Z * [new tag] viable/strict/1763706411 -> viable/strict/1763706411 2025-12-04T08:56:59.9398106Z * [new tag] viable/strict/1763708082 -> viable/strict/1763708082 2025-12-04T08:56:59.9399341Z * [new tag] viable/strict/1763711381 -> viable/strict/1763711381 2025-12-04T08:56:59.9400581Z * [new tag] viable/strict/1763713593 -> viable/strict/1763713593 2025-12-04T08:56:59.9401947Z * [new tag] viable/strict/1763715201 -> viable/strict/1763715201 2025-12-04T08:56:59.9403311Z * [new tag] viable/strict/1763733017 -> viable/strict/1763733017 2025-12-04T08:56:59.9404594Z * [new tag] viable/strict/1763735108 -> viable/strict/1763735108 2025-12-04T08:56:59.9405898Z * [new tag] viable/strict/1763749579 -> viable/strict/1763749579 2025-12-04T08:56:59.9407283Z * [new tag] viable/strict/1763751113 -> viable/strict/1763751113 2025-12-04T08:56:59.9408717Z * [new tag] viable/strict/1763753035 -> viable/strict/1763753035 2025-12-04T08:56:59.9410021Z * [new tag] viable/strict/1763754578 -> viable/strict/1763754578 2025-12-04T08:56:59.9411468Z * [new tag] viable/strict/1763756748 -> viable/strict/1763756748 2025-12-04T08:56:59.9412903Z * [new tag] viable/strict/1763758205 -> viable/strict/1763758205 2025-12-04T08:56:59.9414201Z * [new tag] viable/strict/1763764050 -> viable/strict/1763764050 2025-12-04T08:56:59.9415505Z * [new tag] viable/strict/1763771887 -> viable/strict/1763771887 2025-12-04T08:56:59.9416941Z * [new tag] viable/strict/1763773920 -> viable/strict/1763773920 2025-12-04T08:56:59.9418258Z * [new tag] viable/strict/1763776501 -> viable/strict/1763776501 2025-12-04T08:56:59.9419973Z * [new tag] viable/strict/1763779437 -> viable/strict/1763779437 2025-12-04T08:56:59.9421497Z * [new tag] viable/strict/1763781038 -> viable/strict/1763781038 2025-12-04T08:56:59.9422703Z * [new tag] viable/strict/1763782245 -> viable/strict/1763782245 2025-12-04T08:56:59.9424138Z * [new tag] viable/strict/1763785568 -> viable/strict/1763785568 2025-12-04T08:56:59.9425451Z * [new tag] viable/strict/1763787006 -> viable/strict/1763787006 2025-12-04T08:56:59.9426893Z * [new tag] viable/strict/1763789103 -> viable/strict/1763789103 2025-12-04T08:56:59.9428272Z * [new tag] viable/strict/1763790578 -> viable/strict/1763790578 2025-12-04T08:56:59.9429590Z * [new tag] viable/strict/1763796275 -> viable/strict/1763796275 2025-12-04T08:56:59.9431183Z * [new tag] viable/strict/1763801465 -> viable/strict/1763801465 2025-12-04T08:56:59.9432501Z * [new tag] viable/strict/1763803522 -> viable/strict/1763803522 2025-12-04T08:56:59.9433819Z * [new tag] viable/strict/1763808581 -> viable/strict/1763808581 2025-12-04T08:56:59.9435138Z * [new tag] viable/strict/1763840977 -> viable/strict/1763840977 2025-12-04T08:56:59.9436450Z * [new tag] viable/strict/1763846659 -> viable/strict/1763846659 2025-12-04T08:56:59.9437774Z * [new tag] viable/strict/1763872065 -> viable/strict/1763872065 2025-12-04T08:56:59.9439203Z * [new tag] viable/strict/1763873648 -> viable/strict/1763873648 2025-12-04T08:56:59.9440534Z * [new tag] viable/strict/1763875506 -> viable/strict/1763875506 2025-12-04T08:56:59.9441768Z * [new tag] viable/strict/1763889904 -> viable/strict/1763889904 2025-12-04T08:56:59.9443105Z * [new tag] viable/strict/1763930999 -> viable/strict/1763930999 2025-12-04T08:56:59.9444445Z * [new tag] viable/strict/1763944964 -> viable/strict/1763944964 2025-12-04T08:56:59.9445627Z * [new tag] viable/strict/1763958474 -> viable/strict/1763958474 2025-12-04T08:56:59.9446975Z * [new tag] viable/strict/1763967263 -> viable/strict/1763967263 2025-12-04T08:56:59.9448339Z * [new tag] viable/strict/1763972803 -> viable/strict/1763972803 2025-12-04T08:56:59.9449620Z * [new tag] viable/strict/1763976376 -> viable/strict/1763976376 2025-12-04T08:56:59.9450955Z * [new tag] viable/strict/1763989404 -> viable/strict/1763989404 2025-12-04T08:56:59.9452244Z * [new tag] viable/strict/1763990887 -> viable/strict/1763990887 2025-12-04T08:56:59.9453709Z * [new tag] viable/strict/1764019919 -> viable/strict/1764019919 2025-12-04T08:56:59.9455048Z * [new tag] viable/strict/1764023134 -> viable/strict/1764023134 2025-12-04T08:56:59.9456248Z * [new tag] viable/strict/1764024593 -> viable/strict/1764024593 2025-12-04T08:56:59.9457619Z * [new tag] viable/strict/1764026706 -> viable/strict/1764026706 2025-12-04T08:56:59.9459113Z * [new tag] viable/strict/1764031139 -> viable/strict/1764031139 2025-12-04T08:56:59.9460505Z * [new tag] viable/strict/1764033131 -> viable/strict/1764033131 2025-12-04T08:56:59.9461903Z * [new tag] viable/strict/1764035725 -> viable/strict/1764035725 2025-12-04T08:56:59.9463209Z * [new tag] viable/strict/1764624265 -> viable/strict/1764624265 2025-12-04T08:56:59.9464403Z * [new tag] viable/strict/1764631514 -> viable/strict/1764631514 2025-12-04T08:56:59.9465600Z * [new tag] viable/strict/1764632987 -> viable/strict/1764632987 2025-12-04T08:56:59.9466857Z * [new tag] viable/strict/1764636063 -> viable/strict/1764636063 2025-12-04T08:56:59.9468020Z * [new tag] viable/strict/1764643975 -> viable/strict/1764643975 2025-12-04T08:56:59.9469267Z * [new tag] viable/strict/1764646859 -> viable/strict/1764646859 2025-12-04T08:56:59.9470441Z * [new tag] viable/strict/1764653120 -> viable/strict/1764653120 2025-12-04T08:56:59.9471844Z * [new tag] viable/strict/1764654632 -> viable/strict/1764654632 2025-12-04T08:56:59.9472771Z * [new tag] viable/strict/1764656821 -> viable/strict/1764656821 2025-12-04T08:56:59.9474055Z * [new tag] viable/strict/1764658557 -> viable/strict/1764658557 2025-12-04T08:56:59.9475233Z * [new tag] viable/strict/1764660333 -> viable/strict/1764660333 2025-12-04T08:56:59.9476424Z * [new tag] viable/strict/1764661812 -> viable/strict/1764661812 2025-12-04T08:56:59.9477614Z * [new tag] viable/strict/1764664023 -> viable/strict/1764664023 2025-12-04T08:56:59.9478773Z * [new tag] viable/strict/1764669150 -> viable/strict/1764669150 2025-12-04T08:56:59.9479955Z * [new tag] viable/strict/1764680709 -> viable/strict/1764680709 2025-12-04T08:56:59.9481223Z * [new tag] viable/strict/1764687619 -> viable/strict/1764687619 2025-12-04T08:56:59.9482433Z * [new tag] viable/strict/1764696355 -> viable/strict/1764696355 2025-12-04T08:56:59.9483616Z * [new tag] viable/strict/1764701767 -> viable/strict/1764701767 2025-12-04T08:56:59.9484809Z * [new tag] viable/strict/1764710768 -> viable/strict/1764710768 2025-12-04T08:56:59.9486035Z * [new tag] viable/strict/1764716202 -> viable/strict/1764716202 2025-12-04T08:56:59.9487200Z * [new tag] viable/strict/1764793566 -> viable/strict/1764793566 2025-12-04T08:56:59.9488384Z * [new tag] viable/strict/1764797093 -> viable/strict/1764797093 2025-12-04T08:56:59.9489584Z * [new tag] viable/strict/1764800729 -> viable/strict/1764800729 2025-12-04T08:56:59.9490886Z * [new tag] whc_flight_1 -> whc_flight_1 2025-12-04T08:56:59.9492136Z * [new tag] whc_flight_2 -> whc_flight_2 2025-12-04T08:56:59.9493702Z * [new tag] whc_flight_4 -> whc_flight_4 2025-12-04T08:57:00.0330150Z [command]/usr/bin/git rev-parse --verify --quiet ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32^{object} 2025-12-04T08:57:00.0356205Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:57:00.0360738Z ##[endgroup] 2025-12-04T08:57:00.0361021Z ##[group]Determining the checkout info 2025-12-04T08:57:00.0362156Z ##[endgroup] 2025-12-04T08:57:00.0365672Z [command]/usr/bin/git sparse-checkout disable 2025-12-04T08:57:00.0398798Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-12-04T08:57:00.0425977Z ##[group]Checking out the ref 2025-12-04T08:57:00.0429635Z [command]/usr/bin/git checkout --progress --force ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:57:01.0740160Z Updating files: 75% (15241/20121) 2025-12-04T08:57:01.0893767Z Updating files: 76% (15292/20121) 2025-12-04T08:57:01.1039133Z Updating files: 77% (15494/20121) 2025-12-04T08:57:01.1244276Z Updating files: 78% (15695/20121) 2025-12-04T08:57:01.1500082Z Updating files: 79% (15896/20121) 2025-12-04T08:57:01.1804141Z Updating files: 80% (16097/20121) 2025-12-04T08:57:01.2079748Z Updating files: 81% (16299/20121) 2025-12-04T08:57:01.2298411Z Updating files: 82% (16500/20121) 2025-12-04T08:57:01.2462554Z Updating files: 83% (16701/20121) 2025-12-04T08:57:01.2613697Z Updating files: 84% (16902/20121) 2025-12-04T08:57:01.2789303Z Updating files: 85% (17103/20121) 2025-12-04T08:57:01.2958130Z Updating files: 86% (17305/20121) 2025-12-04T08:57:01.3111871Z Updating files: 87% (17506/20121) 2025-12-04T08:57:01.3240866Z Updating files: 88% (17707/20121) 2025-12-04T08:57:01.3391244Z Updating files: 89% (17908/20121) 2025-12-04T08:57:01.3572643Z Updating files: 90% (18109/20121) 2025-12-04T08:57:01.3705433Z Updating files: 91% (18311/20121) 2025-12-04T08:57:01.3871628Z Updating files: 92% (18512/20121) 2025-12-04T08:57:01.4060974Z Updating files: 93% (18713/20121) 2025-12-04T08:57:01.4268813Z Updating files: 94% (18914/20121) 2025-12-04T08:57:01.4451575Z Updating files: 95% (19115/20121) 2025-12-04T08:57:01.4620988Z Updating files: 96% (19317/20121) 2025-12-04T08:57:01.4794600Z Updating files: 97% (19518/20121) 2025-12-04T08:57:01.5072227Z Updating files: 98% (19719/20121) 2025-12-04T08:57:01.5255509Z Updating files: 99% (19920/20121) 2025-12-04T08:57:01.5255913Z Updating files: 100% (20121/20121) 2025-12-04T08:57:01.5256284Z Updating files: 100% (20121/20121), done. 2025-12-04T08:57:01.5491932Z Note: switching to 'ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32'. 2025-12-04T08:57:01.5492435Z 2025-12-04T08:57:01.5492644Z You are in 'detached HEAD' state. You can look around, make experimental 2025-12-04T08:57:01.5493140Z changes and commit them, and you can discard any commits you make in this 2025-12-04T08:57:01.5493639Z state without impacting any branches by switching back to a branch. 2025-12-04T08:57:01.5493931Z 2025-12-04T08:57:01.5494116Z If you want to create a new branch to retain commits you create, you may 2025-12-04T08:57:01.5494592Z do so (now or later) by using -c with the switch command. Example: 2025-12-04T08:57:01.5494862Z 2025-12-04T08:57:01.5494975Z git switch -c 2025-12-04T08:57:01.5495157Z 2025-12-04T08:57:01.5495257Z Or undo this operation with: 2025-12-04T08:57:01.5495427Z 2025-12-04T08:57:01.5495508Z git switch - 2025-12-04T08:57:01.5495630Z 2025-12-04T08:57:01.5495852Z Turn off this advice by setting config variable advice.detachedHead to false 2025-12-04T08:57:01.5496162Z 2025-12-04T08:57:01.5496399Z HEAD is now at ffd9b0fb435 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T08:57:01.5618610Z ##[endgroup] 2025-12-04T08:57:01.5619795Z ##[group]Setting up auth for fetching submodules 2025-12-04T08:57:01.5627625Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T08:57:01.5677045Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-12-04T08:57:01.5707645Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-12-04T08:57:01.5736424Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-12-04T08:57:01.5760990Z ##[endgroup] 2025-12-04T08:57:01.5761669Z ##[group]Fetching submodules 2025-12-04T08:57:01.5765820Z [command]/usr/bin/git submodule sync --recursive 2025-12-04T08:57:01.6104412Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2025-12-04T08:57:01.6431708Z Submodule 'android/libs/fbjni' (https://github.com/facebookincubator/fbjni.git) registered for path 'android/libs/fbjni' 2025-12-04T08:57:01.6434277Z Submodule 'third_party/NNPACK_deps/FP16' (https://github.com/Maratyszcza/FP16.git) registered for path 'third_party/FP16' 2025-12-04T08:57:01.6437475Z Submodule 'third_party/NNPACK_deps/FXdiv' (https://github.com/Maratyszcza/FXdiv.git) registered for path 'third_party/FXdiv' 2025-12-04T08:57:01.6440764Z Submodule 'third_party/NNPACK' (https://github.com/Maratyszcza/NNPACK.git) registered for path 'third_party/NNPACK' 2025-12-04T08:57:01.6443869Z Submodule 'third_party/NVTX' (https://github.com/NVIDIA/NVTX.git) registered for path 'third_party/NVTX' 2025-12-04T08:57:01.6447520Z Submodule 'third_party/VulkanMemoryAllocator' (https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git) registered for path 'third_party/VulkanMemoryAllocator' 2025-12-04T08:57:01.6450843Z Submodule 'third_party/XNNPACK' (https://github.com/google/XNNPACK.git) registered for path 'third_party/XNNPACK' 2025-12-04T08:57:01.6454487Z Submodule 'third_party/aiter' (https://github.com/ROCm/aiter.git) registered for path 'third_party/aiter' 2025-12-04T08:57:01.6458112Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/benchmark' 2025-12-04T08:57:01.6462234Z Submodule 'third_party/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/composable_kernel' 2025-12-04T08:57:01.6466568Z Submodule 'third_party/cpp-httplib' (https://github.com/yhirose/cpp-httplib.git) registered for path 'third_party/cpp-httplib' 2025-12-04T08:57:01.6470142Z Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo.git) registered for path 'third_party/cpuinfo' 2025-12-04T08:57:01.6473954Z Submodule 'third_party/cudnn_frontend' (https://github.com/NVIDIA/cudnn-frontend.git) registered for path 'third_party/cudnn_frontend' 2025-12-04T08:57:01.6477942Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/cutlass' 2025-12-04T08:57:01.6482016Z Submodule 'third_party/fbgemm' (https://github.com/pytorch/fbgemm) registered for path 'third_party/fbgemm' 2025-12-04T08:57:01.6487307Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'third_party/flash-attention' 2025-12-04T08:57:01.6494400Z Submodule 'third_party/flatbuffers' (https://github.com/google/flatbuffers.git) registered for path 'third_party/flatbuffers' 2025-12-04T08:57:01.6498642Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/fmt' 2025-12-04T08:57:01.6503176Z Submodule 'third_party/gemmlowp/gemmlowp' (https://github.com/google/gemmlowp.git) registered for path 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:57:01.6507529Z Submodule 'third_party/gloo' (https://github.com/pytorch/gloo) registered for path 'third_party/gloo' 2025-12-04T08:57:01.6512159Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest' 2025-12-04T08:57:01.6516666Z Submodule 'third_party/ideep' (https://github.com/intel/ideep) registered for path 'third_party/ideep' 2025-12-04T08:57:01.6521453Z Submodule 'third_party/ittapi' (https://github.com/intel/ittapi.git) registered for path 'third_party/ittapi' 2025-12-04T08:57:01.6526082Z Submodule 'third_party/kineto' (https://github.com/pytorch/kineto) registered for path 'third_party/kineto' 2025-12-04T08:57:01.6530969Z Submodule 'third_party/kleidiai' (https://github.com/ARM-software/kleidiai.git) registered for path 'third_party/kleidiai' 2025-12-04T08:57:01.6536046Z Submodule 'third_party/mimalloc' (https://github.com/microsoft/mimalloc.git) registered for path 'third_party/mimalloc' 2025-12-04T08:57:01.6541017Z Submodule 'third_party/nlohmann' (https://github.com/nlohmann/json.git) registered for path 'third_party/nlohmann' 2025-12-04T08:57:01.6545941Z Submodule 'third_party/onnx' (https://github.com/onnx/onnx.git) registered for path 'third_party/onnx' 2025-12-04T08:57:01.6551387Z Submodule 'third_party/opentelemetry-cpp' (https://github.com/open-telemetry/opentelemetry-cpp.git) registered for path 'third_party/opentelemetry-cpp' 2025-12-04T08:57:01.6556437Z Submodule 'third_party/pocketfft' (https://github.com/mreineck/pocketfft) registered for path 'third_party/pocketfft' 2025-12-04T08:57:01.6562010Z Submodule 'third_party/protobuf' (https://github.com/protocolbuffers/protobuf.git) registered for path 'third_party/protobuf' 2025-12-04T08:57:01.6567439Z Submodule 'third_party/NNPACK_deps/psimd' (https://github.com/Maratyszcza/psimd.git) registered for path 'third_party/psimd' 2025-12-04T08:57:01.6573072Z Submodule 'third_party/NNPACK_deps/pthreadpool' (https://github.com/Maratyszcza/pthreadpool.git) registered for path 'third_party/pthreadpool' 2025-12-04T08:57:01.6581298Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/pybind11' 2025-12-04T08:57:01.6586995Z Submodule 'third_party/python-peachpy' (https://github.com/malfet/PeachPy.git) registered for path 'third_party/python-peachpy' 2025-12-04T08:57:01.6593047Z Submodule 'third_party/sleef' (https://github.com/shibatch/sleef) registered for path 'third_party/sleef' 2025-12-04T08:57:01.6599253Z Submodule 'third_party/tensorpipe' (https://github.com/pytorch/tensorpipe.git) registered for path 'third_party/tensorpipe' 2025-12-04T08:57:01.6629561Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/android/libs/fbjni'... 2025-12-04T08:57:01.9142545Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FXdiv'... 2025-12-04T08:57:01.9143891Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FP16'... 2025-12-04T08:57:01.9166299Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fmt'... 2025-12-04T08:57:04.4765834Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/NNPACK'... 2025-12-04T08:57:04.4767435Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/benchmark'... 2025-12-04T08:57:04.4768604Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/NVTX'... 2025-12-04T08:57:04.4769867Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gemmlowp/gemmlowp'... 2025-12-04T08:57:04.4771000Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpuinfo'... 2025-12-04T08:57:04.4772171Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention'... 2025-12-04T08:57:04.4773415Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gloo'... 2025-12-04T08:57:04.4774469Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpp-httplib'... 2025-12-04T08:57:04.4775336Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep'... 2025-12-04T08:57:04.4776168Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ittapi'... 2025-12-04T08:57:04.4777021Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kleidiai'... 2025-12-04T08:57:04.4777883Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pocketfft'... 2025-12-04T08:57:04.4778796Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cudnn_frontend'... 2025-12-04T08:57:04.4779665Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/psimd'... 2025-12-04T08:57:04.4780539Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/googletest'... 2025-12-04T08:57:04.4781462Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pthreadpool'... 2025-12-04T08:57:04.4855615Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flatbuffers'... 2025-12-04T08:57:04.7266559Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/mimalloc'... 2025-12-04T08:57:04.7267871Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/VulkanMemoryAllocator'... 2025-12-04T08:57:04.7269024Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/python-peachpy'... 2025-12-04T08:57:04.8267651Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe'... 2025-12-04T08:57:05.2064138Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx'... 2025-12-04T08:57:07.2665680Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto'... 2025-12-04T08:57:07.2667131Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/sleef'... 2025-12-04T08:57:07.2668485Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pybind11'... 2025-12-04T08:57:07.2669867Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cutlass'... 2025-12-04T08:57:07.2760782Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm'... 2025-12-04T08:57:07.3666597Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/XNNPACK'... 2025-12-04T08:57:22.9341782Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/composable_kernel'... 2025-12-04T08:57:22.9342564Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/nlohmann'... 2025-12-04T08:57:22.9343240Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp'... 2025-12-04T08:57:22.9344205Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/aiter'... 2025-12-04T08:57:22.9345079Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf'... 2025-12-04T08:57:22.9536005Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2025-12-04T08:57:22.9684077Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2025-12-04T08:57:22.9796799Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2025-12-04T08:57:23.0087994Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2025-12-04T08:57:23.0990242Z Submodule path 'third_party/NVTX': checked out '3ebbc93ded7285963bff932c678fa367eb393ba6' 2025-12-04T08:57:23.1563194Z Submodule path 'third_party/VulkanMemoryAllocator': checked out '1d8f600fd424278486eade7ed3e877c99f0846b1' 2025-12-04T08:57:24.0343816Z Submodule path 'third_party/XNNPACK': checked out '51a0103656eff6fc9bfd39a4597923c4b542c883' 2025-12-04T08:57:24.2264233Z Submodule path 'third_party/aiter': checked out '01aae101b9e5e94d6c16a9514c9fb8df99c93150' 2025-12-04T08:57:24.2286564Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:57:24.2318841Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/aiter/3rdparty/composable_kernel'... 2025-12-04T08:57:28.6397477Z Submodule path 'third_party/aiter/3rdparty/composable_kernel': checked out 'cffe8fa2a442ac8e80dd236a1a5d24fe3d7e0cbf' 2025-12-04T08:57:28.6670265Z Submodule path 'third_party/benchmark': checked out '299e5928955cc62af9968370293b916f5130916f' 2025-12-04T08:57:29.0757708Z Submodule path 'third_party/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T08:57:29.1293740Z Submodule path 'third_party/cpp-httplib': checked out '89c932f313c6437c38f2982869beacc89c2f2246' 2025-12-04T08:57:29.2281661Z Submodule path 'third_party/cpuinfo': checked out 'f858c30bcb16f8effd5ff46996f0514539e17abc' 2025-12-04T08:57:29.2790722Z Submodule path 'third_party/cudnn_frontend': checked out '0b1577c8c83401237d601d0d0db5210506705396' 2025-12-04T08:57:29.9929226Z Submodule path 'third_party/cutlass': checked out 'f88806b1e31dfa579842638740216dd41fc6c588' 2025-12-04T08:57:30.1681804Z Submodule path 'third_party/fbgemm': checked out 'c0b988d39a9e47c794d699f29930ed4d7c7e13a4' 2025-12-04T08:57:30.1709479Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'third_party/fbgemm/external/asmjit' 2025-12-04T08:57:30.1712729Z Submodule 'external/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:57:30.1716029Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:57:30.1719837Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'third_party/fbgemm/external/cutlass' 2025-12-04T08:57:30.1723394Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'third_party/fbgemm/external/googletest' 2025-12-04T08:57:30.1727038Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:57:30.1730603Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'third_party/fbgemm/external/json' 2025-12-04T08:57:30.1763657Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/asmjit'... 2025-12-04T08:57:31.3061431Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/hipify_torch'... 2025-12-04T08:57:31.3062942Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/cpuinfo'... 2025-12-04T08:57:31.3063959Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/googletest'... 2025-12-04T08:57:31.4062075Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/composable_kernel'... 2025-12-04T08:57:34.4149219Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/cutlass'... 2025-12-04T08:57:34.5151826Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/json'... 2025-12-04T08:57:36.7045887Z Submodule path 'third_party/fbgemm/external/asmjit': checked out 'a3199e8857792cd10b7589ff5d58343d2c9008ea' 2025-12-04T08:57:37.1132784Z Submodule path 'third_party/fbgemm/external/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T08:57:37.2158798Z Submodule path 'third_party/fbgemm/external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-12-04T08:57:37.9139231Z Submodule path 'third_party/fbgemm/external/cutlass': checked out '98125ce499b0fdf7ffbe0e3052f5b8709f4840f8' 2025-12-04T08:57:37.9630344Z Submodule path 'third_party/fbgemm/external/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:57:37.9768773Z Submodule path 'third_party/fbgemm/external/hipify_torch': checked out '63b6a7b541fa7f08f8475ca7d74054db36ff2691' 2025-12-04T08:57:38.0919451Z Submodule path 'third_party/fbgemm/external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-12-04T08:57:38.1732527Z Submodule path 'third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2025-12-04T08:57:38.1757995Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:57:38.1761095Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:57:38.1790967Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention/csrc/composable_kernel'... 2025-12-04T08:57:42.0515869Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention/csrc/cutlass'... 2025-12-04T08:57:42.3465100Z Submodule path 'third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2025-12-04T08:57:42.9687752Z Submodule path 'third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2025-12-04T08:57:43.1281142Z Submodule path 'third_party/flatbuffers': checked out 'a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757' 2025-12-04T08:57:43.1595151Z Submodule path 'third_party/fmt': checked out '407c905e45ad75fc29bf0f9bb7c5c2fd3475976f' 2025-12-04T08:57:43.2010535Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2025-12-04T08:57:43.2302391Z Submodule path 'third_party/gloo': checked out '54cbae0d3a67fa890b4c3d9ee162b7860315e341' 2025-12-04T08:57:43.2776146Z Submodule path 'third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:57:43.2928795Z Submodule path 'third_party/ideep': checked out '719d8e6cd7f7a0e01b155657526d693acf97c2b3' 2025-12-04T08:57:43.2947650Z Submodule 'mkl-dnn' (https://github.com/intel/mkl-dnn.git) registered for path 'third_party/ideep/mkl-dnn' 2025-12-04T08:57:43.2977756Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep/mkl-dnn'... 2025-12-04T08:57:57.4965332Z Submodule path 'third_party/ideep/mkl-dnn': checked out '8d263e693366ef8db40acc569cc7d8edf644556d' 2025-12-04T08:57:57.5198236Z Submodule path 'third_party/ittapi': checked out 'dec1d23ca65ab069d225dfe40dea14f455170959' 2025-12-04T08:57:57.6055611Z Submodule path 'third_party/kineto': checked out '31f85df8fbd89c188f14ef10f1ec65379786b943' 2025-12-04T08:57:57.6077798Z Submodule 'libkineto/third_party/dynolog' (https://github.com/facebookincubator/dynolog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:57:57.6080633Z Submodule 'libkineto/third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:57:57.6084242Z Submodule 'libkineto/third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:57:57.6116309Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog'... 2025-12-04T08:57:58.2806101Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/fmt'... 2025-12-04T08:57:58.7121921Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/googletest'... 2025-12-04T08:57:58.8055025Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' 2025-12-04T08:57:58.8075452Z Submodule 'third_party/DCGM' (https://github.com/NVIDIA/DCGM.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:57:58.8078612Z Submodule 'third_party/cpr' (https://github.com/libcpr/cpr.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:57:58.8081945Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:57:58.8085527Z Submodule 'third_party/gflags' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:57:58.8089123Z Submodule 'third_party/glog' (https://github.com/google/glog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:57:58.8092977Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:57:58.8096889Z Submodule 'third_party/json' (https://github.com/nlohmann/json.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:57:58.8100645Z Submodule 'third_party/pfs' (https://github.com/dtrugman/pfs.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:57:58.8105196Z Submodule 'third_party/prometheus-cpp' (https://github.com/jupp0r/prometheus-cpp.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:57:58.8137181Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM'... 2025-12-04T08:58:00.6322405Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/pfs'... 2025-12-04T08:58:00.6323483Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags'... 2025-12-04T08:58:00.6324526Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp'... 2025-12-04T08:58:00.6325522Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/cpr'... 2025-12-04T08:58:00.6326479Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/glog'... 2025-12-04T08:58:00.6327478Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/googletest'... 2025-12-04T08:58:00.6328622Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/fmt'... 2025-12-04T08:58:00.7323264Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/json'... 2025-12-04T08:58:05.2149874Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2025-12-04T08:58:05.2369607Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2025-12-04T08:58:05.2758942Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2025-12-04T08:58:05.2922616Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2025-12-04T08:58:05.2941694Z Submodule 'doc' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:58:05.2976171Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc'... 2025-12-04T08:58:05.5498616Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2025-12-04T08:58:05.5723734Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2025-12-04T08:58:05.6197387Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:58:05.7262182Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2025-12-04T08:58:05.7459715Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2025-12-04T08:58:05.7668458Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' 2025-12-04T08:58:05.7688441Z Submodule 'civetweb' (https://github.com/civetweb/civetweb.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:05.7691944Z Submodule 'googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:05.7721016Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb'... 2025-12-04T08:58:07.6736657Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest'... 2025-12-04T08:58:07.9210025Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' 2025-12-04T08:58:07.9703318Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T08:58:08.0043776Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' 2025-12-04T08:58:08.0514387Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:58:08.1123464Z Submodule path 'third_party/kleidiai': checked out 'd7770c89632329a9914ef1a90289917597639cbe' 2025-12-04T08:58:08.1550889Z Submodule path 'third_party/mimalloc': checked out 'fbd8b99c2b828428947d70fdc046bb55609be93e' 2025-12-04T08:58:08.2783044Z Submodule path 'third_party/nlohmann': checked out '55f93686c01528224f448c19128836e7df245f72' 2025-12-04T08:58:08.8041648Z Submodule path 'third_party/onnx': checked out 'e709452ef2bbc1d113faf678c24e6d3467696e83' 2025-12-04T08:58:08.8083591Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/onnx/third_party/pybind11' 2025-12-04T08:58:08.8115095Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx/third_party/pybind11'... 2025-12-04T08:58:09.5740178Z Submodule path 'third_party/onnx/third_party/pybind11': checked out 'a2e59f0e7065404b44dfe92a28aca47ba1378dc4' 2025-12-04T08:58:09.6588070Z Submodule path 'third_party/opentelemetry-cpp': checked out 'a799f4aed9c94b765dcdaabaeab7d5e7e2310878' 2025-12-04T08:58:09.6609893Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark) registered for path 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:58:09.6613324Z Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:58:09.6616693Z Submodule 'third_party/ms-gsl' (https://github.com/microsoft/GSL) registered for path 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:58:09.6620175Z Submodule 'third_party/nlohmann-json' (https://github.com/nlohmann/json) registered for path 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:58:09.6623848Z Submodule 'third_party/opentelemetry-proto' (https://github.com/open-telemetry/opentelemetry-proto) registered for path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:58:09.6627354Z Submodule 'third_party/opentracing-cpp' (https://github.com/opentracing/opentracing-cpp.git) registered for path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:58:09.6631033Z Submodule 'third_party/prometheus-cpp' (https://github.com/jupp0r/prometheus-cpp) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:58:09.6634934Z Submodule 'tools/vcpkg' (https://github.com/Microsoft/vcpkg) registered for path 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:58:09.6666488Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/benchmark'... 2025-12-04T08:58:10.0612853Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/opentracing-cpp'... 2025-12-04T08:58:10.0614113Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/opentelemetry-proto'... 2025-12-04T08:58:10.0615390Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp'... 2025-12-04T08:58:10.0616144Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/ms-gsl'... 2025-12-04T08:58:10.1614117Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/googletest'... 2025-12-04T08:58:10.6616107Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/nlohmann-json'... 2025-12-04T08:58:17.4781475Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/tools/vcpkg'... 2025-12-04T08:58:17.7217023Z Submodule path 'third_party/opentelemetry-cpp/third_party/benchmark': checked out 'd572f4777349d43653b21d6c2fc63020ab326db2' 2025-12-04T08:58:17.7647677Z Submodule path 'third_party/opentelemetry-cpp/third_party/googletest': checked out 'b796f7d44681514f58a683a3a71ff17c94edb0c1' 2025-12-04T08:58:17.7836751Z Submodule path 'third_party/opentelemetry-cpp/third_party/ms-gsl': checked out '6f4529395c5b7c2d661812257cd6780c67e54afa' 2025-12-04T08:58:17.8985101Z Submodule path 'third_party/opentelemetry-cpp/third_party/nlohmann-json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d' 2025-12-04T08:58:17.9145776Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto': checked out '4ca4f0335c63cda7ab31ea7ed70d6553aee14dce' 2025-12-04T08:58:17.9325213Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp': checked out '06b57f48ded1fa3bdd3d4346f6ef29e40e08eaf5' 2025-12-04T08:58:17.9523572Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp': checked out 'c9ffcdda9086ffd9e1283ea7a0276d831f3c8a8d' 2025-12-04T08:58:17.9542865Z Submodule 'civetweb' (https://github.com/civetweb/civetweb.git) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:17.9546059Z Submodule 'googletest' (https://github.com/google/googletest.git) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:17.9576375Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb'... 2025-12-04T08:58:19.8592237Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest'... 2025-12-04T08:58:20.1036809Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'eefb26f82b233268fc98577d265352720d477ba4' 2025-12-04T08:58:20.1522265Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T08:58:20.7867915Z Submodule path 'third_party/opentelemetry-cpp/tools/vcpkg': checked out '8eb57355a4ffb410a2e94c07b4dca2dffbee8e50' 2025-12-04T08:58:20.8003244Z Submodule path 'third_party/pocketfft': checked out '0fa0ef591e38c2758e3184c6c23e497b9f732ffa' 2025-12-04T08:58:21.0906301Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2025-12-04T08:58:21.0931415Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:58:21.0934797Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/protobuf/third_party/googletest' 2025-12-04T08:58:21.0967125Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/benchmark'... 2025-12-04T08:58:21.6201185Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/googletest'... 2025-12-04T08:58:21.9125670Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2025-12-04T08:58:21.9840921Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2025-12-04T08:58:21.9957307Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2025-12-04T08:58:22.0102049Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2025-12-04T08:58:22.0574052Z Submodule path 'third_party/pybind11': checked out 'f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8' 2025-12-04T08:58:22.0885519Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2025-12-04T08:58:22.1333934Z Submodule path 'third_party/sleef': checked out '5a1d179df9cf652951b59010a2d2075372d67f68' 2025-12-04T08:58:22.1650108Z Submodule path 'third_party/tensorpipe': checked out '2b4cd91092d335a697416b2a3cb398283246849d' 2025-12-04T08:58:22.1669978Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:58:22.1673143Z Submodule 'third_party/libnop' (https://github.com/google/libnop.git) registered for path 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:58:22.1676365Z Submodule 'third_party/libuv' (https://github.com/libuv/libuv.git) registered for path 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:58:22.1680065Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:58:22.1709829Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/googletest'... 2025-12-04T08:58:23.0996420Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libnop'... 2025-12-04T08:58:23.0997995Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11'... 2025-12-04T08:58:23.1700077Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libuv'... 2025-12-04T08:58:23.2266128Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2025-12-04T08:58:23.2455587Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2025-12-04T08:58:23.3215186Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '5152db2cbfeb5582e9c27c5ea1dba2cd9e10759b' 2025-12-04T08:58:23.3537860Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2025-12-04T08:58:23.3556708Z Submodule 'tools/clang' (https://github.com/wjakob/clang-cindex-python3) registered for path 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:58:23.3587888Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11/tools/clang'... 2025-12-04T08:58:23.5471742Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2025-12-04T08:58:23.5515614Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-12-04T08:58:23.5855857Z Entering 'android/libs/fbjni' 2025-12-04T08:58:23.5904183Z Entering 'third_party/FP16' 2025-12-04T08:58:23.5953155Z Entering 'third_party/FXdiv' 2025-12-04T08:58:23.6002169Z Entering 'third_party/NNPACK' 2025-12-04T08:58:23.6054379Z Entering 'third_party/NVTX' 2025-12-04T08:58:23.6103925Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:58:23.6155012Z Entering 'third_party/XNNPACK' 2025-12-04T08:58:23.6216864Z Entering 'third_party/aiter' 2025-12-04T08:58:23.6264155Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:58:23.6323630Z Entering 'third_party/benchmark' 2025-12-04T08:58:23.6373213Z Entering 'third_party/composable_kernel' 2025-12-04T08:58:23.6432769Z Entering 'third_party/cpp-httplib' 2025-12-04T08:58:23.6486078Z Entering 'third_party/cpuinfo' 2025-12-04T08:58:23.6533930Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:58:23.6586358Z Entering 'third_party/cutlass' 2025-12-04T08:58:23.6642389Z Entering 'third_party/fbgemm' 2025-12-04T08:58:23.6693816Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:58:23.6744490Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:58:23.6799989Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:58:23.6845997Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:58:23.6901546Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:58:23.6946503Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:58:23.6992989Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:58:23.7048290Z Entering 'third_party/flash-attention' 2025-12-04T08:58:23.7095309Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:58:23.7150714Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:58:23.7207156Z Entering 'third_party/flatbuffers' 2025-12-04T08:58:23.7259527Z Entering 'third_party/fmt' 2025-12-04T08:58:23.7306552Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:58:23.7354072Z Entering 'third_party/gloo' 2025-12-04T08:58:23.7406061Z Entering 'third_party/googletest' 2025-12-04T08:58:23.7453470Z Entering 'third_party/ideep' 2025-12-04T08:58:23.7500386Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:58:23.7553113Z Entering 'third_party/ittapi' 2025-12-04T08:58:23.7607211Z Entering 'third_party/kineto' 2025-12-04T08:58:23.7653372Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:58:23.7709081Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:58:23.7756718Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:58:23.7803370Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:58:23.7855604Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:58:23.7903598Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:58:23.7958830Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:58:23.8012597Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:58:23.8065731Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:58:23.8114329Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:58:23.8165796Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:58:23.8212080Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:23.8260984Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:23.8312462Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:58:23.8363511Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:58:23.8417914Z Entering 'third_party/kleidiai' 2025-12-04T08:58:23.8466283Z Entering 'third_party/mimalloc' 2025-12-04T08:58:23.8514058Z Entering 'third_party/nlohmann' 2025-12-04T08:58:23.8568235Z Entering 'third_party/onnx' 2025-12-04T08:58:23.8629134Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:58:23.8681941Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:58:23.8731394Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:58:23.8777943Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:58:23.8822973Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:58:23.8868220Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:58:23.8915424Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:58:23.8962846Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:58:23.9012109Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:58:23.9062593Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:23.9114499Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:23.9168898Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:58:23.9233795Z Entering 'third_party/pocketfft' 2025-12-04T08:58:23.9284218Z Entering 'third_party/protobuf' 2025-12-04T08:58:23.9334494Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:58:23.9383273Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:58:23.9430926Z Entering 'third_party/psimd' 2025-12-04T08:58:23.9477660Z Entering 'third_party/pthreadpool' 2025-12-04T08:58:23.9524702Z Entering 'third_party/pybind11' 2025-12-04T08:58:23.9575032Z Entering 'third_party/python-peachpy' 2025-12-04T08:58:23.9626691Z Entering 'third_party/sleef' 2025-12-04T08:58:23.9675137Z Entering 'third_party/tensorpipe' 2025-12-04T08:58:23.9721825Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:58:23.9773498Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:58:23.9827523Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:58:23.9874671Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:58:23.9920725Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:58:23.9986054Z ##[endgroup] 2025-12-04T08:58:23.9986510Z ##[group]Persisting credentials for submodules 2025-12-04T08:58:23.9993428Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-12-04T08:58:24.0325616Z Entering 'android/libs/fbjni' 2025-12-04T08:58:24.0394322Z Entering 'third_party/FP16' 2025-12-04T08:58:24.0467556Z Entering 'third_party/FXdiv' 2025-12-04T08:58:24.0535623Z Entering 'third_party/NNPACK' 2025-12-04T08:58:24.0596363Z Entering 'third_party/NVTX' 2025-12-04T08:58:24.0658475Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:58:24.0722571Z Entering 'third_party/XNNPACK' 2025-12-04T08:58:24.0799442Z Entering 'third_party/aiter' 2025-12-04T08:58:24.0866063Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:58:24.0943757Z Entering 'third_party/benchmark' 2025-12-04T08:58:24.1006286Z Entering 'third_party/composable_kernel' 2025-12-04T08:58:24.1079903Z Entering 'third_party/cpp-httplib' 2025-12-04T08:58:24.1145810Z Entering 'third_party/cpuinfo' 2025-12-04T08:58:24.1206862Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:58:24.1275331Z Entering 'third_party/cutlass' 2025-12-04T08:58:24.1353781Z Entering 'third_party/fbgemm' 2025-12-04T08:58:24.1416951Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:58:24.1482595Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:58:24.1552535Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:58:24.1616303Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:58:24.1690226Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:58:24.1752218Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:58:24.1816760Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:58:24.1882641Z Entering 'third_party/flash-attention' 2025-12-04T08:58:24.1945471Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:58:24.2012511Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:58:24.2083475Z Entering 'third_party/flatbuffers' 2025-12-04T08:58:24.2148607Z Entering 'third_party/fmt' 2025-12-04T08:58:24.2213544Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:58:24.2276959Z Entering 'third_party/gloo' 2025-12-04T08:58:24.2344310Z Entering 'third_party/googletest' 2025-12-04T08:58:24.2408307Z Entering 'third_party/ideep' 2025-12-04T08:58:24.2473781Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:58:24.2543303Z Entering 'third_party/ittapi' 2025-12-04T08:58:24.2613297Z Entering 'third_party/kineto' 2025-12-04T08:58:24.2675690Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:58:24.2737311Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:58:24.2803058Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:58:24.2867438Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:58:24.2934999Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:58:24.3002265Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:58:24.3075715Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:58:24.3146088Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:58:24.3207887Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:58:24.3276811Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:58:24.3341322Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:58:24.3402786Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:24.3478924Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:24.3551087Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:58:24.3613151Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:58:24.3677817Z Entering 'third_party/kleidiai' 2025-12-04T08:58:24.3747842Z Entering 'third_party/mimalloc' 2025-12-04T08:58:24.3818018Z Entering 'third_party/nlohmann' 2025-12-04T08:58:24.3883832Z Entering 'third_party/onnx' 2025-12-04T08:58:24.3959942Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:58:24.4031067Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:58:24.4094635Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:58:24.4163649Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:58:24.4225496Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:58:24.4285640Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:58:24.4356939Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:58:24.4422485Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:58:24.4483176Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:58:24.4543710Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:24.4608398Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:24.4679864Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:58:24.4760484Z Entering 'third_party/pocketfft' 2025-12-04T08:58:24.4826884Z Entering 'third_party/protobuf' 2025-12-04T08:58:24.4893746Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:58:24.4956026Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:58:24.5026056Z Entering 'third_party/psimd' 2025-12-04T08:58:24.5090071Z Entering 'third_party/pthreadpool' 2025-12-04T08:58:24.5153860Z Entering 'third_party/pybind11' 2025-12-04T08:58:24.5217384Z Entering 'third_party/python-peachpy' 2025-12-04T08:58:24.5282290Z Entering 'third_party/sleef' 2025-12-04T08:58:24.5345962Z Entering 'third_party/tensorpipe' 2025-12-04T08:58:24.5406156Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:58:24.5472946Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:58:24.5546340Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:58:24.5606105Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:58:24.5670731Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:58:24.5756010Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-12-04T08:58:24.6089708Z Entering 'android/libs/fbjni' 2025-12-04T08:58:24.6154312Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T08:58:24.6176354Z Entering 'third_party/FP16' 2025-12-04T08:58:24.6234067Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T08:58:24.6253952Z Entering 'third_party/FXdiv' 2025-12-04T08:58:24.6315881Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T08:58:24.6335293Z Entering 'third_party/NNPACK' 2025-12-04T08:58:24.6398021Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T08:58:24.6417479Z Entering 'third_party/NVTX' 2025-12-04T08:58:24.6477223Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T08:58:24.6497061Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:58:24.6555741Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T08:58:24.6575623Z Entering 'third_party/XNNPACK' 2025-12-04T08:58:24.6633147Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T08:58:24.6666944Z Entering 'third_party/aiter' 2025-12-04T08:58:24.6725241Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T08:58:24.6744714Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:58:24.6802969Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T08:58:24.6831516Z Entering 'third_party/benchmark' 2025-12-04T08:58:24.6887711Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:58:24.6908360Z Entering 'third_party/composable_kernel' 2025-12-04T08:58:24.6965273Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T08:58:24.6992293Z Entering 'third_party/cpp-httplib' 2025-12-04T08:58:24.7055621Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T08:58:24.7075378Z Entering 'third_party/cpuinfo' 2025-12-04T08:58:24.7132916Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T08:58:24.7153604Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:58:24.7212626Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T08:58:24.7232747Z Entering 'third_party/cutlass' 2025-12-04T08:58:24.7295449Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T08:58:24.7323189Z Entering 'third_party/fbgemm' 2025-12-04T08:58:24.7383873Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T08:58:24.7404983Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:58:24.7462876Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T08:58:24.7482687Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:58:24.7542787Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T08:58:24.7570552Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:58:24.7626649Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T08:58:24.7646061Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:58:24.7704518Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T08:58:24.7732132Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:58:24.7788390Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T08:58:24.7807181Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:58:24.7864816Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T08:58:24.7883779Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:58:24.7941426Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T08:58:24.7963412Z Entering 'third_party/flash-attention' 2025-12-04T08:58:24.8022538Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T08:58:24.8042282Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:58:24.8101467Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T08:58:24.8126833Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:58:24.8194740Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T08:58:24.8223434Z Entering 'third_party/flatbuffers' 2025-12-04T08:58:24.8281949Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T08:58:24.8305018Z Entering 'third_party/fmt' 2025-12-04T08:58:24.8363679Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:58:24.8384026Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:58:24.8442787Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T08:58:24.8463453Z Entering 'third_party/gloo' 2025-12-04T08:58:24.8521771Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T08:58:24.8542442Z Entering 'third_party/googletest' 2025-12-04T08:58:24.8597953Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:58:24.8617675Z Entering 'third_party/ideep' 2025-12-04T08:58:24.8676669Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T08:58:24.8694490Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:58:24.8752311Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T08:58:24.8780383Z Entering 'third_party/ittapi' 2025-12-04T08:58:24.8836929Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T08:58:24.8856344Z Entering 'third_party/kineto' 2025-12-04T08:58:24.8915545Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T08:58:24.8934343Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:58:24.8993608Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T08:58:24.9012564Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:58:24.9072698Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T08:58:24.9094174Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:58:24.9153293Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T08:58:24.9173591Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:58:24.9233026Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:58:24.9253270Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:58:24.9313115Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T08:58:24.9331719Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:58:24.9391515Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T08:58:24.9414432Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:58:24.9473397Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T08:58:24.9493452Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:58:24.9552697Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:58:24.9573090Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:58:24.9633165Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T08:58:24.9654336Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:58:24.9713511Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T08:58:24.9733848Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:58:24.9794305Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:58:24.9812962Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:24.9877746Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:58:24.9899053Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:24.9958105Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:58:24.9982555Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:58:25.0041617Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T08:58:25.0062055Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:58:25.0118400Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T08:58:25.0139756Z Entering 'third_party/kleidiai' 2025-12-04T08:58:25.0196505Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T08:58:25.0216872Z Entering 'third_party/mimalloc' 2025-12-04T08:58:25.0275999Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T08:58:25.0296127Z Entering 'third_party/nlohmann' 2025-12-04T08:58:25.0353372Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T08:58:25.0375153Z Entering 'third_party/onnx' 2025-12-04T08:58:25.0436948Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T08:58:25.0470732Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:58:25.0531092Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:58:25.0553279Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:58:25.0611198Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T08:58:25.0631686Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:58:25.0688184Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:58:25.0707250Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:58:25.0763286Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:58:25.0782931Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:58:25.0841779Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T08:58:25.0859711Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:58:25.0916464Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T08:58:25.0937100Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:58:25.0996420Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T08:58:25.1015289Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:58:25.1077617Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T08:58:25.1096345Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:58:25.1158108Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:58:25.1175830Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:25.1234153Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:58:25.1255530Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:25.1317396Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:58:25.1338942Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:58:25.1398399Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T08:58:25.1435655Z Entering 'third_party/pocketfft' 2025-12-04T08:58:25.1500619Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T08:58:25.1519239Z Entering 'third_party/protobuf' 2025-12-04T08:58:25.1577465Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T08:58:25.1598362Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:58:25.1654462Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:58:25.1673767Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:58:25.1732466Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:58:25.1754827Z Entering 'third_party/psimd' 2025-12-04T08:58:25.1821171Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T08:58:25.1841567Z Entering 'third_party/pthreadpool' 2025-12-04T08:58:25.1901606Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T08:58:25.1921969Z Entering 'third_party/pybind11' 2025-12-04T08:58:25.1980080Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:58:25.2004406Z Entering 'third_party/python-peachpy' 2025-12-04T08:58:25.2066930Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T08:58:25.2086219Z Entering 'third_party/sleef' 2025-12-04T08:58:25.2149649Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T08:58:25.2169607Z Entering 'third_party/tensorpipe' 2025-12-04T08:58:25.2226970Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T08:58:25.2245047Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:58:25.2302888Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:58:25.2322533Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:58:25.2380758Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T08:58:25.2400811Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:58:25.2459766Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T08:58:25.2478799Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:58:25.2536193Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:58:25.2553620Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:58:25.2612790Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T08:58:25.3457265Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-12-04T08:58:25.3790865Z Entering 'android/libs/fbjni' 2025-12-04T08:58:25.3837614Z Entering 'third_party/FP16' 2025-12-04T08:58:25.3885780Z Entering 'third_party/FXdiv' 2025-12-04T08:58:25.3933655Z Entering 'third_party/NNPACK' 2025-12-04T08:58:25.3980920Z Entering 'third_party/NVTX' 2025-12-04T08:58:25.4027949Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:58:25.4075051Z Entering 'third_party/XNNPACK' 2025-12-04T08:58:25.4136020Z Entering 'third_party/aiter' 2025-12-04T08:58:25.4184534Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:58:25.4245086Z Entering 'third_party/benchmark' 2025-12-04T08:58:25.4296777Z Entering 'third_party/composable_kernel' 2025-12-04T08:58:25.4351503Z Entering 'third_party/cpp-httplib' 2025-12-04T08:58:25.4397238Z Entering 'third_party/cpuinfo' 2025-12-04T08:58:25.4443396Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:58:25.4490927Z Entering 'third_party/cutlass' 2025-12-04T08:58:25.4544721Z Entering 'third_party/fbgemm' 2025-12-04T08:58:25.4594671Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:58:25.4642199Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:58:25.4695912Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:58:25.4743624Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:58:25.4795567Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:58:25.4846255Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:58:25.4894302Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:58:25.4949891Z Entering 'third_party/flash-attention' 2025-12-04T08:58:25.4996846Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:58:25.5049514Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:58:25.5104330Z Entering 'third_party/flatbuffers' 2025-12-04T08:58:25.5154512Z Entering 'third_party/fmt' 2025-12-04T08:58:25.5203450Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:58:25.5256919Z Entering 'third_party/gloo' 2025-12-04T08:58:25.5305246Z Entering 'third_party/googletest' 2025-12-04T08:58:25.5353675Z Entering 'third_party/ideep' 2025-12-04T08:58:25.5412826Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:58:25.5471960Z Entering 'third_party/ittapi' 2025-12-04T08:58:25.5525215Z Entering 'third_party/kineto' 2025-12-04T08:58:25.5577373Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:58:25.5623724Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:58:25.5678240Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:58:25.5725722Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:58:25.5777884Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:58:25.5823945Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:58:25.5880182Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:58:25.5927377Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:58:25.5974745Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:58:25.6028648Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:58:25.6076293Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:58:25.6123083Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:25.6171326Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:25.6223147Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:58:25.6267881Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:58:25.6318056Z Entering 'third_party/kleidiai' 2025-12-04T08:58:25.6370569Z Entering 'third_party/mimalloc' 2025-12-04T08:58:25.6417000Z Entering 'third_party/nlohmann' 2025-12-04T08:58:25.6466562Z Entering 'third_party/onnx' 2025-12-04T08:58:25.6527937Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:58:25.6579012Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:58:25.6626052Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:58:25.6679351Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:58:25.6725379Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:58:25.6776486Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:58:25.6824720Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:58:25.6872833Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:58:25.6919215Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:58:25.6964573Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:25.7018689Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:25.7068609Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:58:25.7133425Z Entering 'third_party/pocketfft' 2025-12-04T08:58:25.7182683Z Entering 'third_party/protobuf' 2025-12-04T08:58:25.7233751Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:58:25.7280389Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:58:25.7329239Z Entering 'third_party/psimd' 2025-12-04T08:58:25.7375768Z Entering 'third_party/pthreadpool' 2025-12-04T08:58:25.7423842Z Entering 'third_party/pybind11' 2025-12-04T08:58:25.7473721Z Entering 'third_party/python-peachpy' 2025-12-04T08:58:25.7522501Z Entering 'third_party/sleef' 2025-12-04T08:58:25.7576840Z Entering 'third_party/tensorpipe' 2025-12-04T08:58:25.7624530Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:58:25.7674693Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:58:25.7722433Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:58:25.7769756Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:58:25.7813282Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:58:25.7888630Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-12-04T08:58:25.8228505Z Entering 'android/libs/fbjni' 2025-12-04T08:58:25.8276012Z Entering 'third_party/FP16' 2025-12-04T08:58:25.8323938Z Entering 'third_party/FXdiv' 2025-12-04T08:58:25.8377549Z Entering 'third_party/NNPACK' 2025-12-04T08:58:25.8425167Z Entering 'third_party/NVTX' 2025-12-04T08:58:25.8475438Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:58:25.8523413Z Entering 'third_party/XNNPACK' 2025-12-04T08:58:25.8590453Z Entering 'third_party/aiter' 2025-12-04T08:58:25.8637121Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:58:25.8693288Z Entering 'third_party/benchmark' 2025-12-04T08:58:25.8746746Z Entering 'third_party/composable_kernel' 2025-12-04T08:58:25.8802787Z Entering 'third_party/cpp-httplib' 2025-12-04T08:58:25.8855397Z Entering 'third_party/cpuinfo' 2025-12-04T08:58:25.8909199Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:58:25.8956424Z Entering 'third_party/cutlass' 2025-12-04T08:58:25.9011383Z Entering 'third_party/fbgemm' 2025-12-04T08:58:25.9059314Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:58:25.9105937Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:58:25.9160670Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:58:25.9207713Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:58:25.9262641Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:58:25.9307900Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:58:25.9354367Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:58:25.9410038Z Entering 'third_party/flash-attention' 2025-12-04T08:58:25.9464400Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:58:25.9522291Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:58:25.9584309Z Entering 'third_party/flatbuffers' 2025-12-04T08:58:25.9636113Z Entering 'third_party/fmt' 2025-12-04T08:58:25.9685918Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:58:25.9734844Z Entering 'third_party/gloo' 2025-12-04T08:58:25.9786063Z Entering 'third_party/googletest' 2025-12-04T08:58:25.9834601Z Entering 'third_party/ideep' 2025-12-04T08:58:25.9885041Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:58:25.9944725Z Entering 'third_party/ittapi' 2025-12-04T08:58:25.9993739Z Entering 'third_party/kineto' 2025-12-04T08:58:26.0041831Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:58:26.0086742Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:58:26.0135593Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:58:26.0187756Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:58:26.0234891Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:58:26.0285687Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:58:26.0339981Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:58:26.0385976Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:58:26.0433901Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:58:26.0483810Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:58:26.0537054Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:58:26.0584488Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:26.0635663Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:26.0695022Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:58:26.0745926Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:58:26.0799593Z Entering 'third_party/kleidiai' 2025-12-04T08:58:26.0847416Z Entering 'third_party/mimalloc' 2025-12-04T08:58:26.0895948Z Entering 'third_party/nlohmann' 2025-12-04T08:58:26.0945904Z Entering 'third_party/onnx' 2025-12-04T08:58:26.1008019Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:58:26.1062702Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:58:26.1113670Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:58:26.1164798Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:58:26.1216996Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:58:26.1264388Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:58:26.1317678Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:58:26.1364416Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:58:26.1416701Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:58:26.1462330Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:26.1509372Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:26.1558775Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:58:26.1623231Z Entering 'third_party/pocketfft' 2025-12-04T08:58:26.1675632Z Entering 'third_party/protobuf' 2025-12-04T08:58:26.1729212Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:58:26.1775907Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:58:26.1826237Z Entering 'third_party/psimd' 2025-12-04T08:58:26.1874820Z Entering 'third_party/pthreadpool' 2025-12-04T08:58:26.1926655Z Entering 'third_party/pybind11' 2025-12-04T08:58:26.1974944Z Entering 'third_party/python-peachpy' 2025-12-04T08:58:26.2026018Z Entering 'third_party/sleef' 2025-12-04T08:58:26.2075024Z Entering 'third_party/tensorpipe' 2025-12-04T08:58:26.2125078Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:58:26.2177982Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:58:26.2224292Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:58:26.2275145Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:58:26.2324215Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:58:26.2400804Z ##[endgroup] 2025-12-04T08:58:26.2437341Z [command]/usr/bin/git log -1 --format=%H 2025-12-04T08:58:26.2460254Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:58:26.2574370Z ##[group]Run cd "${GITHUB_WORKSPACE}" 2025-12-04T08:58:26.2574640Z cd "${GITHUB_WORKSPACE}" 2025-12-04T08:58:26.2574853Z # Clean stale submodule dirs 2025-12-04T08:58:26.2575071Z if [ -z "${NO_SUDO}" ]; then 2025-12-04T08:58:26.2575336Z  sudo git submodule foreach --recursive git clean -ffdx 2025-12-04T08:58:26.2575751Z else 2025-12-04T08:58:26.2576098Z  git submodule foreach --recursive git clean -ffdx 2025-12-04T08:58:26.2576354Z fi 2025-12-04T08:58:26.2587391Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:26.2587675Z env: 2025-12-04T08:58:26.2587837Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:26.2588020Z NO_SUDO: true 2025-12-04T08:58:26.2588183Z ##[endgroup] 2025-12-04T08:58:26.2945211Z Entering 'android/libs/fbjni' 2025-12-04T08:58:26.2981833Z Entering 'third_party/FP16' 2025-12-04T08:58:26.3026758Z Entering 'third_party/FXdiv' 2025-12-04T08:58:26.3067864Z Entering 'third_party/NNPACK' 2025-12-04T08:58:26.3106706Z Entering 'third_party/NVTX' 2025-12-04T08:58:26.3155068Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:58:26.3196335Z Entering 'third_party/XNNPACK' 2025-12-04T08:58:26.3321214Z Entering 'third_party/aiter' 2025-12-04T08:58:26.3366694Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:58:26.3482498Z Entering 'third_party/benchmark' 2025-12-04T08:58:26.3526012Z Entering 'third_party/composable_kernel' 2025-12-04T08:58:26.3649449Z Entering 'third_party/cpp-httplib' 2025-12-04T08:58:26.3686299Z Entering 'third_party/cpuinfo' 2025-12-04T08:58:26.3726842Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:58:26.3765976Z Entering 'third_party/cutlass' 2025-12-04T08:58:26.3875272Z Entering 'third_party/fbgemm' 2025-12-04T08:58:26.3939374Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:58:26.3975606Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:58:26.4103494Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:58:26.4143854Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:58:26.4251820Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:58:26.4296189Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:58:26.4334495Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:58:26.4389182Z Entering 'third_party/flash-attention' 2025-12-04T08:58:26.4432761Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:58:26.4536711Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:58:26.4630594Z Entering 'third_party/flatbuffers' 2025-12-04T08:58:26.4708065Z Entering 'third_party/fmt' 2025-12-04T08:58:26.4746958Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:58:26.4785395Z Entering 'third_party/gloo' 2025-12-04T08:58:26.4824013Z Entering 'third_party/googletest' 2025-12-04T08:58:26.4864278Z Entering 'third_party/ideep' 2025-12-04T08:58:26.4902278Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:58:26.4988673Z Entering 'third_party/ittapi' 2025-12-04T08:58:26.5025995Z Entering 'third_party/kineto' 2025-12-04T08:58:26.5064741Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:58:26.5105532Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:58:26.5155430Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:58:26.5196442Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:58:26.5234008Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:58:26.5268165Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:58:26.5309774Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:58:26.5345839Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:58:26.5384945Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:58:26.5435741Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:58:26.5475225Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:58:26.5514677Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:26.5570109Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:26.5613142Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:58:26.5653519Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:58:26.5695287Z Entering 'third_party/kleidiai' 2025-12-04T08:58:26.5738848Z Entering 'third_party/mimalloc' 2025-12-04T08:58:26.5775805Z Entering 'third_party/nlohmann' 2025-12-04T08:58:26.5825332Z Entering 'third_party/onnx' 2025-12-04T08:58:26.6171430Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:58:26.6211845Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:58:26.6272063Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:58:26.6306379Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:58:26.6344430Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:58:26.6377595Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:58:26.6424072Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:58:26.6463157Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:58:26.6497268Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:58:26.6532810Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:58:26.6586170Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:58:26.6626881Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:58:26.6891454Z Entering 'third_party/pocketfft' 2025-12-04T08:58:26.6936551Z Entering 'third_party/protobuf' 2025-12-04T08:58:26.7017908Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:58:26.7053726Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:58:26.7096798Z Entering 'third_party/psimd' 2025-12-04T08:58:26.7136805Z Entering 'third_party/pthreadpool' 2025-12-04T08:58:26.7174005Z Entering 'third_party/pybind11' 2025-12-04T08:58:26.7213949Z Entering 'third_party/python-peachpy' 2025-12-04T08:58:26.7256162Z Entering 'third_party/sleef' 2025-12-04T08:58:26.7295600Z Entering 'third_party/tensorpipe' 2025-12-04T08:58:26.7334226Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:58:26.7377082Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:58:26.7412459Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:58:26.7457163Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:58:26.7490694Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:58:26.7648223Z Prepare all required actions 2025-12-04T08:58:26.7648831Z Getting action download info 2025-12-04T08:58:26.9328968Z ##[group]Run ./.github/actions/setup-linux 2025-12-04T08:58:26.9329200Z env: 2025-12-04T08:58:26.9329357Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:26.9329561Z ##[endgroup] 2025-12-04T08:58:26.9361033Z ##[group]Run set -euo pipefail 2025-12-04T08:58:26.9361292Z set -euo pipefail 2025-12-04T08:58:26.9361505Z function get_ec2_metadata() { 2025-12-04T08:58:26.9362006Z  # Pulled from instance metadata endpoint for EC2 2025-12-04T08:58:26.9362536Z  # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html 2025-12-04T08:58:26.9362938Z  category=$1 2025-12-04T08:58:26.9363199Z  # If it is GCP runner (runner name contains gcp), do not run this 2025-12-04T08:58:26.9363497Z  runner_name_str=i-004b897e3464f6ebc 2025-12-04T08:58:26.9363768Z  if [[ -f /.inarc ]]; then 2025-12-04T08:58:26.9364009Z  echo "ARC Runner, no info on ec2 metadata" 2025-12-04T08:58:26.9364278Z  elif [[ $runner_name_str == *"gcp"* ]]; then 2025-12-04T08:58:26.9364600Z  echo "Runner is from Google Cloud Platform, No info on ec2 metadata" 2025-12-04T08:58:26.9364894Z  else 2025-12-04T08:58:26.9365699Z  curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}" 2025-12-04T08:58:26.9366332Z  fi 2025-12-04T08:58:26.9366481Z } 2025-12-04T08:58:26.9366668Z echo "ami-id: $(get_ec2_metadata ami-id)" 2025-12-04T08:58:26.9366958Z echo "instance-id: $(get_ec2_metadata instance-id)" 2025-12-04T08:58:26.9367289Z echo "instance-type: $(get_ec2_metadata instance-type)" 2025-12-04T08:58:26.9367581Z echo "system info $(uname -a)" 2025-12-04T08:58:26.9376558Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:26.9376843Z env: 2025-12-04T08:58:26.9377000Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:26.9377183Z ##[endgroup] 2025-12-04T08:58:26.9521811Z ami-id: ami-08982f1c5bf93d976 2025-12-04T08:58:26.9624880Z instance-id: i-004b897e3464f6ebc 2025-12-04T08:58:26.9730015Z instance-type: g6.4xlarge 2025-12-04T08:58:26.9742551Z system info Linux ip-10-0-34-254.ec2.internal 6.1.150-174.273.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Sep 9 12:21:26 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-12-04T08:58:26.9762873Z ##[group]Run if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi 2025-12-04T08:58:26.9763213Z if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi 2025-12-04T08:58:26.9770774Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:26.9771051Z env: 2025-12-04T08:58:26.9771218Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:26.9771407Z ##[endgroup] 2025-12-04T08:58:28.3989481Z Thu Dec 4 08:58:28 2025 2025-12-04T08:58:28.3990101Z +-----------------------------------------------------------------------------------------+ 2025-12-04T08:58:28.3990825Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T08:58:28.3991491Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T08:58:28.3992188Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T08:58:28.3992917Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T08:58:28.3993503Z | | | MIG M. | 2025-12-04T08:58:28.3993944Z |=========================================+========================+======================| 2025-12-04T08:58:28.4057069Z | 0 NVIDIA L4 Off | 00000000:35:00.0 Off | 0 | 2025-12-04T08:58:28.4057867Z | N/A 40C P0 30W / 72W | 0MiB / 23034MiB | 4% Default | 2025-12-04T08:58:28.4058269Z | | | N/A | 2025-12-04T08:58:28.4058771Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T08:58:28.4059093Z 2025-12-04T08:58:28.4059267Z +-----------------------------------------------------------------------------------------+ 2025-12-04T08:58:28.4059680Z | Processes: | 2025-12-04T08:58:28.4060112Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T08:58:28.4060516Z | ID ID Usage | 2025-12-04T08:58:28.4060844Z |=========================================================================================| 2025-12-04T08:58:28.4061641Z | No running processes found | 2025-12-04T08:58:28.4062365Z +-----------------------------------------------------------------------------------------+ 2025-12-04T08:58:28.7279860Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T08:58:28.7280699Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T08:58:28.7289657Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:28.7289938Z env: 2025-12-04T08:58:28.7290099Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:28.7290295Z ##[endgroup] 2025-12-04T08:58:28.7352660Z ##[group]Run if systemctl is-active --quiet docker; then 2025-12-04T08:58:28.7353204Z if systemctl is-active --quiet docker; then 2025-12-04T08:58:28.7353684Z  echo "Docker daemon is running..."; 2025-12-04T08:58:28.7354103Z else 2025-12-04T08:58:28.7354545Z  echo "Starting docker daemon..." && sudo systemctl start docker; 2025-12-04T08:58:28.7355093Z fi 2025-12-04T08:58:28.7365303Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:28.7365786Z env: 2025-12-04T08:58:28.7366044Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:28.7366382Z ##[endgroup] 2025-12-04T08:58:28.7460166Z Docker daemon is running... 2025-12-04T08:58:28.7494974Z ##[group]Run nick-fields/retry@v3.0.0 2025-12-04T08:58:28.7495189Z with: 2025-12-04T08:58:28.7495336Z shell: bash 2025-12-04T08:58:28.7495497Z timeout_minutes: 5 2025-12-04T08:58:28.7495665Z max_attempts: 3 2025-12-04T08:58:28.7495826Z retry_wait_seconds: 30 2025-12-04T08:58:28.7497441Z command: AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\") aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" # For LF Runners we need to make sure we also login to Meta's ECR docker registry too. META_AWS_ACCOUNT_ID=308535385114 if [ "$AWS_ACCOUNT_ID" != "$META_AWS_ACCOUNT_ID" ] ; then aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ --password-stdin "$META_AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" fi 2025-12-04T08:58:28.7499039Z polling_interval_seconds: 1 2025-12-04T08:58:28.7499243Z warning_on_retry: true 2025-12-04T08:58:28.7499426Z continue_on_error: false 2025-12-04T08:58:28.7499599Z env: 2025-12-04T08:58:28.7499747Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:28.7499929Z AWS_RETRY_MODE: standard 2025-12-04T08:58:28.7500099Z AWS_MAX_ATTEMPTS: 5 2025-12-04T08:58:28.7500285Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:58:28.7500482Z ##[endgroup] 2025-12-04T08:58:29.7646663Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T08:58:29.7647591Z Configure a credential helper to remove this warning. See 2025-12-04T08:58:29.7648219Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T08:58:29.7648504Z 2025-12-04T08:58:29.7648592Z Login Succeeded 2025-12-04T08:58:30.2282097Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T08:58:30.2282778Z Configure a credential helper to remove this warning. See 2025-12-04T08:58:30.2283574Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T08:58:30.2283926Z 2025-12-04T08:58:30.2284031Z Login Succeeded 2025-12-04T08:58:30.8262329Z Command completed after 1 attempt(s). 2025-12-04T08:58:30.8321985Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:58:30.8322360Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:58:30.8322693Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:58:30.8331595Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:30.8331879Z env: 2025-12-04T08:58:30.8332040Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:30.8332234Z ##[endgroup] 2025-12-04T08:58:30.8429942Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T08:58:30.8430356Z # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T08:58:30.8430885Z # shellcheck disable=SC2046 2025-12-04T08:58:30.8431123Z docker stop $(docker ps -q) || true 2025-12-04T08:58:30.8431373Z # Prune all of the docker images 2025-12-04T08:58:30.8431608Z docker system prune -af 2025-12-04T08:58:30.8438629Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:30.8438903Z env: 2025-12-04T08:58:30.8439076Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:30.8439261Z ##[endgroup] 2025-12-04T08:58:30.8683180Z "docker stop" requires at least 1 argument. 2025-12-04T08:58:30.8683581Z See 'docker stop --help'. 2025-12-04T08:58:30.8683747Z 2025-12-04T08:58:30.8683924Z Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] 2025-12-04T08:58:30.8684182Z 2025-12-04T08:58:30.8684286Z Stop one or more running containers 2025-12-04T08:58:30.8942509Z Total reclaimed space: 0B 2025-12-04T08:58:30.9088351Z ##[group]Run pytorch/test-infra/.github/actions/calculate-docker-image@main 2025-12-04T08:58:30.9088746Z with: 2025-12-04T08:58:30.9089337Z docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:30.9089976Z use-custom-docker-registry: true 2025-12-04T08:58:30.9090206Z docker-build-dir: .ci/docker 2025-12-04T08:58:30.9090418Z docker-build-script: ./build.sh 2025-12-04T08:58:30.9090624Z working-directory: . 2025-12-04T08:58:30.9090866Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:30.9091151Z force-push: false 2025-12-04T08:58:30.9091320Z env: 2025-12-04T08:58:30.9091470Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:30.9091786Z ##[endgroup] 2025-12-04T08:58:30.9107985Z ##[group]Run set -ex 2025-12-04T08:58:30.9108198Z set -ex 2025-12-04T08:58:30.9108357Z  2025-12-04T08:58:30.9108663Z # If the docker build directory or the build script doesn't exist, the action will 2025-12-04T08:58:30.9109138Z # gracefully return the docker image name as it is. Pulling docker image in Linux 2025-12-04T08:58:30.9109539Z # job could then download the pre-built image as usual 2025-12-04T08:58:30.9110020Z if [[ -d "${DOCKER_BUILD_DIR}" ]] && [[ -f "${DOCKER_BUILD_DIR}/${DOCKER_BUILD_SCRIPT}" ]] && [[ "${USE_CUSTOM_DOCKER_REGISTRY}" == "true" ]]; then 2025-12-04T08:58:30.9110454Z  echo "skip=false" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9110692Z else 2025-12-04T08:58:30.9110879Z  echo "skip=true" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9111193Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9111474Z  2025-12-04T08:58:30.9111861Z  echo "Not using custom ECR registry. Either it was not requested or there is no Docker build script in the ${REPO_NAME} repo..." 2025-12-04T08:58:30.9112297Z  exit 0 2025-12-04T08:58:30.9112446Z fi 2025-12-04T08:58:30.9112595Z  2025-12-04T08:58:30.9112835Z if [[ "${DOCKER_IMAGE_NAME}" == *"${DOCKER_REGISTRY}/${REPO_NAME}"* ]]; then 2025-12-04T08:58:30.9113253Z  # The docker image name already includes the ECR prefix and tag, so we can just 2025-12-04T08:58:30.9113619Z  # use it as it is, but first let's extract the tag 2025-12-04T08:58:30.9113954Z  DOCKER_TAG=$(echo "${DOCKER_IMAGE_NAME}" | awk -F '[:,]' '{print $2}') 2025-12-04T08:58:30.9114328Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9114662Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9114932Z else 2025-12-04T08:58:30.9115118Z  if [[ "${DOCKER_IMAGE_NAME}" == *:* ]]; then 2025-12-04T08:58:30.9115386Z  CUSTOM_TAG_PREFIX=${DOCKER_IMAGE_NAME#*:} 2025-12-04T08:58:30.9115652Z  DOCKER_IMAGE_NAME=${DOCKER_IMAGE_NAME%%:*} 2025-12-04T08:58:30.9115888Z  fi 2025-12-04T08:58:30.9116384Z  DOCKER_TAG=${CUSTOM_TAG_PREFIX:+${CUSTOM_TAG_PREFIX}-}$(git rev-parse HEAD:"${DOCKER_BUILD_DIR}") 2025-12-04T08:58:30.9116804Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9117237Z  echo "docker-image=${DOCKER_REGISTRY}/${REPO_NAME}/${DOCKER_IMAGE_NAME}:${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9117714Z  echo "custom-tag-prefix=${CUSTOM_TAG_PREFIX}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9118013Z fi 2025-12-04T08:58:30.9125300Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:30.9125589Z env: 2025-12-04T08:58:30.9125746Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:30.9125936Z REPO_NAME: pytorch 2025-12-04T08:58:30.9126656Z DOCKER_IMAGE_NAME: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:30.9127305Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T08:58:30.9127515Z DOCKER_BUILD_SCRIPT: ./build.sh 2025-12-04T08:58:30.9127776Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:30.9128060Z USE_CUSTOM_DOCKER_REGISTRY: true 2025-12-04T08:58:30.9128264Z CUSTOM_TAG_PREFIX: 2025-12-04T08:58:30.9128427Z ##[endgroup] 2025-12-04T08:58:30.9153602Z + [[ -d .ci/docker ]] 2025-12-04T08:58:30.9153896Z + [[ -f .ci/docker/./build.sh ]] 2025-12-04T08:58:30.9154165Z + [[ true == \t\r\u\e ]] 2025-12-04T08:58:30.9154387Z + echo skip=false 2025-12-04T08:58:30.9155379Z + [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a == *\3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h* ]] 2025-12-04T08:58:30.9160813Z ++ echo 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:30.9161567Z ++ awk -F '[:,]' '{print $2}' 2025-12-04T08:58:30.9185561Z + DOCKER_TAG=pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:30.9186407Z + echo docker-tag=pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:30.9187460Z + echo docker-image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:30.9209252Z ##[group]Run set +e 2025-12-04T08:58:30.9209473Z set +e 2025-12-04T08:58:30.9209628Z set -x 2025-12-04T08:58:30.9209783Z  2025-12-04T08:58:30.9209932Z login() { 2025-12-04T08:58:30.9210277Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T08:58:30.9210645Z } 2025-12-04T08:58:30.9210799Z  2025-12-04T08:58:30.9210937Z retry () { 2025-12-04T08:58:30.9211131Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T08:58:30.9211364Z } 2025-12-04T08:58:30.9211500Z  2025-12-04T08:58:30.9211664Z retry login "${DOCKER_REGISTRY}" 2025-12-04T08:58:30.9211876Z  2025-12-04T08:58:30.9212030Z START_TIME=$(date +%s) 2025-12-04T08:58:30.9212233Z # Wait up to 120 minutes 2025-12-04T08:58:30.9212604Z while [[ $(( $(date +%s) - 7200 )) -lt $START_TIME ]]; do 2025-12-04T08:58:30.9212953Z  # Check if image already exists, if it does then skip building it 2025-12-04T08:58:30.9213288Z  if docker manifest inspect "${DOCKER_IMAGE}"; then 2025-12-04T08:58:30.9213543Z  exit 0 2025-12-04T08:58:30.9213707Z  fi 2025-12-04T08:58:30.9213848Z  2025-12-04T08:58:30.9214114Z  # NB: This flag is used by Docker build workflow to push the image to ECR, so we can 2025-12-04T08:58:30.9214571Z  # use this to differentiate between the Docker build and regular build jobs. For the 2025-12-04T08:58:30.9215171Z  # latter, it will wait for the Docker images to become available before continuing 2025-12-04T08:58:30.9215523Z  if [ "${DOCKER_PUSH:-false}" == "true" ]; then 2025-12-04T08:58:30.9215803Z  # It's a Docker build job, let's build the image 2025-12-04T08:58:30.9216044Z  break 2025-12-04T08:58:30.9216203Z  else 2025-12-04T08:58:30.9216436Z  # It's a regular build job, wait for the image to become available 2025-12-04T08:58:30.9216729Z  sleep 300 2025-12-04T08:58:30.9216900Z  fi 2025-12-04T08:58:30.9217043Z done 2025-12-04T08:58:30.9217197Z  2025-12-04T08:58:30.9217441Z # NB: This part requires a full checkout. Otherwise, the merge base will 2025-12-04T08:58:30.9217962Z # be empty. The default action would be to continue rebuild the image 2025-12-04T08:58:30.9218324Z if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then 2025-12-04T08:58:30.9218647Z  # if we're on the base branch then use the parent commit 2025-12-04T08:58:30.9218922Z  MERGE_BASE=$(git rev-parse HEAD~) 2025-12-04T08:58:30.9219131Z else 2025-12-04T08:58:30.9219356Z  # otherwise we're on a PR, so use the most recent base commit 2025-12-04T08:58:30.9219686Z  MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION") 2025-12-04T08:58:30.9219929Z fi 2025-12-04T08:58:30.9220075Z  2025-12-04T08:58:30.9220250Z if [[ -z "${MERGE_BASE}" ]]; then 2025-12-04T08:58:30.9220498Z  echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9220721Z  2025-12-04T08:58:30.9221040Z  echo "Finding merge base only works with full checkout, please set fetch-depth to 0, continuing ..." 2025-12-04T08:58:30.9221425Z  exit 0 2025-12-04T08:58:30.9221583Z fi 2025-12-04T08:58:30.9221732Z  2025-12-04T08:58:30.9221957Z if ! git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}"; then 2025-12-04T08:58:30.9222432Z  echo "Directory '${DOCKER_BUILD_DIR}' not found in commit $MERGE_BASE, you should rebase onto a more recent commit" 2025-12-04T08:58:30.9222829Z  exit 1 2025-12-04T08:58:30.9222987Z fi 2025-12-04T08:58:30.9223137Z  2025-12-04T08:58:30.9223386Z PREVIOUS_DOCKER_TAG=$(git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}") 2025-12-04T08:58:30.9223841Z # If no image exists but the hash is the same as the previous hash then we should error out here 2025-12-04T08:58:30.9224252Z if [[ "${PREVIOUS_DOCKER_TAG}" == "${DOCKER_TAG}" ]]; then 2025-12-04T08:58:30.9224730Z  echo "WARNING: Something has gone wrong and the previous image isn't available for the merge-base of your branch" 2025-12-04T08:58:30.9225260Z  echo " Will re-build docker image to store in local cache, TTS may be longer" 2025-12-04T08:58:30.9225579Z fi 2025-12-04T08:58:30.9225740Z  2025-12-04T08:58:30.9225916Z echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T08:58:30.9233003Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:30.9233287Z env: 2025-12-04T08:58:30.9233450Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:30.9233646Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T08:58:30.9233895Z BASE_REVISION: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:58:30.9234559Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:30.9235357Z DOCKER_TAG: pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:30.9235840Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:30.9236117Z DOCKER_PUSH: 2025-12-04T08:58:30.9236275Z ##[endgroup] 2025-12-04T08:58:30.9261537Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:30.9262575Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:30.9264789Z + aws ecr get-login-password --region us-east-1 2025-12-04T08:58:30.9265878Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:31.3881654Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T08:58:31.3882136Z Configure a credential helper to remove this warning. See 2025-12-04T08:58:31.3882558Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T08:58:31.3882829Z 2025-12-04T08:58:31.3884454Z Login Succeeded 2025-12-04T08:58:31.3906068Z ++ date +%s 2025-12-04T08:58:31.3916942Z + START_TIME=1764838711 2025-12-04T08:58:31.3920341Z ++ date +%s 2025-12-04T08:58:31.3931544Z + [[ 1764831511 -lt 1764838711 ]] 2025-12-04T08:58:31.3932561Z + docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:31.5721320Z { 2025-12-04T08:58:31.5721605Z "schemaVersion": 2, 2025-12-04T08:58:31.5721942Z "mediaType": "application/vnd.docker.distribution.manifest.v2+json", 2025-12-04T08:58:31.5722263Z "config": { 2025-12-04T08:58:31.5722500Z "mediaType": "application/vnd.docker.container.image.v1+json", 2025-12-04T08:58:31.5722787Z "size": 34864, 2025-12-04T08:58:31.5723091Z "digest": "sha256:add7313791033822205cdb3cf32096534b2cfaa4855bd48119b59000bfe00301" 2025-12-04T08:58:31.5723423Z }, 2025-12-04T08:58:31.5723567Z "layers": [ 2025-12-04T08:58:31.5723712Z { 2025-12-04T08:58:31.5723954Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5724258Z "size": 30447951, 2025-12-04T08:58:31.5724578Z "digest": "sha256:63e5bc7682b85ae57a1221210f64d62e7a90b0a30f19af4ca734b8242ae49d63" 2025-12-04T08:58:31.5724911Z }, 2025-12-04T08:58:31.5725040Z { 2025-12-04T08:58:31.5725305Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5725616Z "size": 1554, 2025-12-04T08:58:31.5725896Z "digest": "sha256:0678d56345c994444b77bb70b1177189d23e794748b1d75ffc45d227c7dea94a" 2025-12-04T08:58:31.5726211Z }, 2025-12-04T08:58:31.5726342Z { 2025-12-04T08:58:31.5726567Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5726846Z "size": 313275661, 2025-12-04T08:58:31.5727152Z "digest": "sha256:45f5c9ddfce78349dff3d5edfbaa0310ae17311f66abdcd7e00fa21b500e801c" 2025-12-04T08:58:31.5727491Z }, 2025-12-04T08:58:31.5727635Z { 2025-12-04T08:58:31.5727859Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5728163Z "size": 787, 2025-12-04T08:58:31.5728449Z "digest": "sha256:086b1df51ac1162d9c45698e9dfaf91c6c222c8bd9ab01797ac8f9344bc8044f" 2025-12-04T08:58:31.5728773Z }, 2025-12-04T08:58:31.5728901Z { 2025-12-04T08:58:31.5729118Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5729407Z "size": 106, 2025-12-04T08:58:31.5729688Z "digest": "sha256:fe8a7b64bf98352f89057bcba66beef2fb44cc05fbd3606abccd8e86cf476234" 2025-12-04T08:58:31.5730024Z }, 2025-12-04T08:58:31.5730154Z { 2025-12-04T08:58:31.5730372Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5730649Z "size": 703, 2025-12-04T08:58:31.5730919Z "digest": "sha256:7680723e9a578033dd106b45784c639f06cc8adb1f5239ec513d9de01087c1af" 2025-12-04T08:58:31.5731224Z }, 2025-12-04T08:58:31.5731349Z { 2025-12-04T08:58:31.5731563Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5731847Z "size": 1216, 2025-12-04T08:58:31.5732243Z "digest": "sha256:9c5027aeeb4e3101f48c1d2e400c387110e1009e42497ee801f1b4b7f7edb5c0" 2025-12-04T08:58:31.5732686Z }, 2025-12-04T08:58:31.5732828Z { 2025-12-04T08:58:31.5733038Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5733565Z "size": 483, 2025-12-04T08:58:31.5733831Z "digest": "sha256:9a56521103600bd37a1e7c1191b5136c2d738c092f8a6701499f7068a32c2628" 2025-12-04T08:58:31.5734135Z }, 2025-12-04T08:58:31.5734273Z { 2025-12-04T08:58:31.5734493Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5734778Z "size": 110361875, 2025-12-04T08:58:31.5735057Z "digest": "sha256:375c4427e9141269458333b1463fdb219e736fd6231ec1c56c625c48437ace77" 2025-12-04T08:58:31.5735362Z }, 2025-12-04T08:58:31.5735481Z { 2025-12-04T08:58:31.5735697Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5735974Z "size": 4961, 2025-12-04T08:58:31.5736245Z "digest": "sha256:a86faaa7dbdd70e678e5ea20072637ee42618921ca8f80ca089f789325d4b0c2" 2025-12-04T08:58:31.5736561Z }, 2025-12-04T08:58:31.5736687Z { 2025-12-04T08:58:31.5737054Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5737336Z "size": 1755, 2025-12-04T08:58:31.5737611Z "digest": "sha256:fb7848686804957915d98f8655ef6da0fe4c521b50a82aefdebf475983505a15" 2025-12-04T08:58:31.5737923Z }, 2025-12-04T08:58:31.5738043Z { 2025-12-04T08:58:31.5738258Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5738535Z "size": 724, 2025-12-04T08:58:31.5738806Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T08:58:31.5739120Z }, 2025-12-04T08:58:31.5739247Z { 2025-12-04T08:58:31.5739457Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5739736Z "size": 543, 2025-12-04T08:58:31.5740007Z "digest": "sha256:79dc80f426b29d4ae9157b967050b03e66aa0c4b1295b944a1dd70106be87066" 2025-12-04T08:58:31.5740322Z }, 2025-12-04T08:58:31.5740444Z { 2025-12-04T08:58:31.5740664Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5740949Z "size": 3185190117, 2025-12-04T08:58:31.5741243Z "digest": "sha256:a13fcc1b90bb9c251ebe7ef2a03c4cb3afa1c8bdafe84f5f85136773059a3735" 2025-12-04T08:58:31.5741577Z }, 2025-12-04T08:58:31.5741704Z { 2025-12-04T08:58:31.5741912Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5742197Z "size": 32, 2025-12-04T08:58:31.5742487Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5743047Z }, 2025-12-04T08:58:31.5743174Z { 2025-12-04T08:58:31.5743406Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5743690Z "size": 396, 2025-12-04T08:58:31.5743961Z "digest": "sha256:549db4d6c618ecd9534658a233e3c90508f82d8735f965c2786b2eaa078869e5" 2025-12-04T08:58:31.5744286Z }, 2025-12-04T08:58:31.5744415Z { 2025-12-04T08:58:31.5744636Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5744929Z "size": 236860, 2025-12-04T08:58:31.5745216Z "digest": "sha256:5c63528cb580001e65104f4cb0809bf0673a00f989a7db42fd6d86aa1ec27cee" 2025-12-04T08:58:31.5745535Z }, 2025-12-04T08:58:31.5745665Z { 2025-12-04T08:58:31.5745888Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5746163Z "size": 231, 2025-12-04T08:58:31.5746446Z "digest": "sha256:75bd83b989a44e4d4119a3f972891025eb0e9ce95cfbe4a0ca5cdbe7130028d6" 2025-12-04T08:58:31.5746769Z }, 2025-12-04T08:58:31.5746897Z { 2025-12-04T08:58:31.5747109Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5747405Z "size": 3043497, 2025-12-04T08:58:31.5747691Z "digest": "sha256:de6e78970f517178cb91f36cd02bd9ca7b72a08fb82a0f9007516026f258c035" 2025-12-04T08:58:31.5748003Z }, 2025-12-04T08:58:31.5748130Z { 2025-12-04T08:58:31.5748356Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5748631Z "size": 1472, 2025-12-04T08:58:31.5748923Z "digest": "sha256:e13ed7c7e4736e81dc21af755b3363eb26e4d3b2f1ca988dfe65effa47d8fa42" 2025-12-04T08:58:31.5749341Z }, 2025-12-04T08:58:31.5749474Z { 2025-12-04T08:58:31.5749695Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5749973Z "size": 481, 2025-12-04T08:58:31.5750247Z "digest": "sha256:6e2949bcb74152577a0f20c38bcb6dd80f5e68427e3e531a80e08c9ecc73a979" 2025-12-04T08:58:31.5750568Z }, 2025-12-04T08:58:31.5750706Z { 2025-12-04T08:58:31.5750926Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5751198Z "size": 202, 2025-12-04T08:58:31.5751476Z "digest": "sha256:14d69d9aaec70287efd2fd35c4f93e43a29a4098458cc9fca1c93f02ad7356cb" 2025-12-04T08:58:31.5751799Z }, 2025-12-04T08:58:31.5751923Z { 2025-12-04T08:58:31.5752136Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5752419Z "size": 607, 2025-12-04T08:58:31.5752829Z "digest": "sha256:5c02769dd8e5bba2f7f5fd84bde9595fcb3bdbffcae497503fa846f9b5e78bf5" 2025-12-04T08:58:31.5753176Z }, 2025-12-04T08:58:31.5753309Z { 2025-12-04T08:58:31.5753518Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5753806Z "size": 7889619584, 2025-12-04T08:58:31.5754108Z "digest": "sha256:35041ce524ac4afec40ecd73b1393c830614f1f79d43a6439767a6c7d5b7027b" 2025-12-04T08:58:31.5754433Z }, 2025-12-04T08:58:31.5754558Z { 2025-12-04T08:58:31.5754783Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5755070Z "size": 830, 2025-12-04T08:58:31.5755345Z "digest": "sha256:2fa92dc5885e080e049ceb4139288b6c0e39fab34256945708b08ea55a1f7a0b" 2025-12-04T08:58:31.5755666Z }, 2025-12-04T08:58:31.5755792Z { 2025-12-04T08:58:31.5756004Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5756285Z "size": 33451739, 2025-12-04T08:58:31.5756576Z "digest": "sha256:2b85eafbd92a0e70a0a70154ad8bf4584095e576d95873368f30373f5966714a" 2025-12-04T08:58:31.5756888Z }, 2025-12-04T08:58:31.5757013Z { 2025-12-04T08:58:31.5757232Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5757524Z "size": 104, 2025-12-04T08:58:31.5757809Z "digest": "sha256:ff755a4ddad7880f23c6b767d432d6f1eafdb62b3ea18f8a98e22c441c099fcb" 2025-12-04T08:58:31.5758131Z }, 2025-12-04T08:58:31.5758256Z { 2025-12-04T08:58:31.5758463Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5758742Z "size": 1496, 2025-12-04T08:58:31.5759015Z "digest": "sha256:09eb41bdf42d8605b57b2363348154140904dec914b34a67298b82122bfce2b3" 2025-12-04T08:58:31.5759316Z }, 2025-12-04T08:58:31.5759440Z { 2025-12-04T08:58:31.5759654Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5759925Z "size": 458787828, 2025-12-04T08:58:31.5760210Z "digest": "sha256:11ede4d59e935e62f41b33220fe871794ab5e57ce724173b713368977683bcf6" 2025-12-04T08:58:31.5760524Z }, 2025-12-04T08:58:31.5760645Z { 2025-12-04T08:58:31.5760860Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5761145Z "size": 164, 2025-12-04T08:58:31.5761416Z "digest": "sha256:1283cd8f801a142172f3ab76fd472df8583223d9437de3e4d18d8cf98ea3fa98" 2025-12-04T08:58:31.5761725Z }, 2025-12-04T08:58:31.5762165Z { 2025-12-04T08:58:31.5762410Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5762701Z "size": 346, 2025-12-04T08:58:31.5762980Z "digest": "sha256:024fa855425fa524ad4500660cf61d53be62b99556d31b8b280d14caba434a35" 2025-12-04T08:58:31.5763294Z }, 2025-12-04T08:58:31.5763415Z { 2025-12-04T08:58:31.5763635Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5763928Z "size": 32, 2025-12-04T08:58:31.5764208Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5764557Z }, 2025-12-04T08:58:31.5764697Z { 2025-12-04T08:58:31.5764920Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5765359Z "size": 106, 2025-12-04T08:58:31.5765657Z "digest": "sha256:303e6747a62efecf5efa1f97d0e66b40a3b39da8d79a51f75b89f4c92ae7ec52" 2025-12-04T08:58:31.5765982Z }, 2025-12-04T08:58:31.5766107Z { 2025-12-04T08:58:31.5766326Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5766607Z "size": 424, 2025-12-04T08:58:31.5766882Z "digest": "sha256:3017cdf4838bcc9a33daebc07487f8ae1f6bd6e7ce8322c14f5480e8db9ef90e" 2025-12-04T08:58:31.5767207Z }, 2025-12-04T08:58:31.5767333Z { 2025-12-04T08:58:31.5767551Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5767830Z "size": 19309374, 2025-12-04T08:58:31.5768135Z "digest": "sha256:6b6cd1c358e886dc6ed7fd46ac4bcc1a0a73b7b1301739ea1953478ee5d83f50" 2025-12-04T08:58:31.5768453Z }, 2025-12-04T08:58:31.5768592Z { 2025-12-04T08:58:31.5768919Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5769195Z "size": 108, 2025-12-04T08:58:31.5769475Z "digest": "sha256:b2dd045011241d1cf8889e2a7369d9fe4844dfe15529b520ccd6a59bd3c1532e" 2025-12-04T08:58:31.5769787Z }, 2025-12-04T08:58:31.5769913Z { 2025-12-04T08:58:31.5770120Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5770409Z "size": 827, 2025-12-04T08:58:31.5770685Z "digest": "sha256:55adc51fe5897031d4cf2f2b8fd162213f6e46a52848630c616606271b97952e" 2025-12-04T08:58:31.5770994Z }, 2025-12-04T08:58:31.5771121Z { 2025-12-04T08:58:31.5771339Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5771612Z "size": 724, 2025-12-04T08:58:31.5771880Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T08:58:31.5772189Z }, 2025-12-04T08:58:31.5772382Z { 2025-12-04T08:58:31.5772604Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5772889Z "size": 149, 2025-12-04T08:58:31.5773158Z "digest": "sha256:a43ca0e4b837964b12b7469194cfe939c26de027298040028975324dce25938a" 2025-12-04T08:58:31.5773466Z }, 2025-12-04T08:58:31.5773592Z { 2025-12-04T08:58:31.5773807Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5774079Z "size": 138, 2025-12-04T08:58:31.5774351Z "digest": "sha256:b7212f17fd1404837fcfdd086dd0e2667931e4db377d45d8d89a44390c84e11d" 2025-12-04T08:58:31.5774679Z }, 2025-12-04T08:58:31.5774800Z { 2025-12-04T08:58:31.5775016Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5775298Z "size": 141, 2025-12-04T08:58:31.5775564Z "digest": "sha256:083e42cac090e6486c35f392b64ee54448f5e4aa947003aeb3e1f92c8ea5c099" 2025-12-04T08:58:31.5775896Z }, 2025-12-04T08:58:31.5776043Z { 2025-12-04T08:58:31.5776272Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5776567Z "size": 32, 2025-12-04T08:58:31.5776862Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5777198Z }, 2025-12-04T08:58:31.5777321Z { 2025-12-04T08:58:31.5777546Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5777843Z "size": 223, 2025-12-04T08:58:31.5778125Z "digest": "sha256:0a00b784a4aac341795729b254f7edd09e811b7f51d0c58e0e6bfeeee6940503" 2025-12-04T08:58:31.5778449Z }, 2025-12-04T08:58:31.5778577Z { 2025-12-04T08:58:31.5778795Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5779076Z "size": 255, 2025-12-04T08:58:31.5779360Z "digest": "sha256:c6173c779f7ba143a21214ea5f032b141863a37ceb4c0ac01d3248c216ce5241" 2025-12-04T08:58:31.5779674Z }, 2025-12-04T08:58:31.5779799Z { 2025-12-04T08:58:31.5780013Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5780291Z "size": 145520672, 2025-12-04T08:58:31.5780584Z "digest": "sha256:ed3d1e3387b924585c332bf1bc252fa159cd0d25256a874043ff0141b1ab5ff7" 2025-12-04T08:58:31.5780903Z }, 2025-12-04T08:58:31.5781113Z { 2025-12-04T08:58:31.5781323Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5781600Z "size": 106, 2025-12-04T08:58:31.5781865Z "digest": "sha256:b29343478586aeee19d2a622661716f6f1591280c890f49b727a8da13a610784" 2025-12-04T08:58:31.5782167Z }, 2025-12-04T08:58:31.5782294Z { 2025-12-04T08:58:31.5782519Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5782799Z "size": 312293530, 2025-12-04T08:58:31.5783085Z "digest": "sha256:c6f0520487fb506bc4601fd84d5f28d8a76b203e004731e4b2067c2ab1a14e0b" 2025-12-04T08:58:31.5783404Z }, 2025-12-04T08:58:31.5783526Z { 2025-12-04T08:58:31.5783796Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5784263Z "size": 3058011133, 2025-12-04T08:58:31.5814035Z "digest": "sha256:148171691cd4c4d20310d490d4b4dd903490d04ea07fb8f7e668a28768683e9a" 2025-12-04T08:58:31.5814443Z }, 2025-12-04T08:58:31.5814582Z { 2025-12-04T08:58:31.5814817Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5815111Z "size": 129, 2025-12-04T08:58:31.5815407Z "digest": "sha256:2c666d30ed77fff9ff1167d41cd645dad98280fcbe941f5bc3828c7ae66b1287" 2025-12-04T08:58:31.5815737Z }, 2025-12-04T08:58:31.5815861Z { 2025-12-04T08:58:31.5816101Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5816385Z "size": 880, 2025-12-04T08:58:31.5816661Z "digest": "sha256:5d8d3a0a98e012c5068e0f3bae5a03e3148ecf2d063634eee4c9241a1e3fdfb5" 2025-12-04T08:58:31.5816986Z }, 2025-12-04T08:58:31.5817113Z { 2025-12-04T08:58:31.5817339Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5817627Z "size": 724, 2025-12-04T08:58:31.5817903Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T08:58:31.5818210Z }, 2025-12-04T08:58:31.5818331Z { 2025-12-04T08:58:31.5818551Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5818834Z "size": 139, 2025-12-04T08:58:31.5819101Z "digest": "sha256:b06bafce9e817295d8127207747c80aa18e04392ff0875844fc30a1e794a8a0c" 2025-12-04T08:58:31.5819427Z }, 2025-12-04T08:58:31.5819556Z { 2025-12-04T08:58:31.5819786Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5820074Z "size": 32, 2025-12-04T08:58:31.5820362Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5820684Z }, 2025-12-04T08:58:31.5820813Z { 2025-12-04T08:58:31.5821036Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5821317Z "size": 159, 2025-12-04T08:58:31.5821598Z "digest": "sha256:15e0d7e4590d3d8f598d05aec3a92f891bf8b4605bcc38cc2de852b6014ef8f3" 2025-12-04T08:58:31.5821924Z }, 2025-12-04T08:58:31.5822056Z { 2025-12-04T08:58:31.5822275Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5822557Z "size": 1011, 2025-12-04T08:58:31.5822843Z "digest": "sha256:a514bd1add3164d8d7ca99aa19294c4ed8b97b074635d98714c4f598a959f4cd" 2025-12-04T08:58:31.5823159Z }, 2025-12-04T08:58:31.5823299Z { 2025-12-04T08:58:31.5823530Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5823813Z "size": 724, 2025-12-04T08:58:31.5824095Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T08:58:31.5824417Z }, 2025-12-04T08:58:31.5824543Z { 2025-12-04T08:58:31.5824770Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5825054Z "size": 134, 2025-12-04T08:58:31.5825339Z "digest": "sha256:57b84ee6000204f27a1d9bca199b19be4c86ecd324540dbdf239c56a6c3b34ea" 2025-12-04T08:58:31.5825659Z }, 2025-12-04T08:58:31.5825792Z { 2025-12-04T08:58:31.5826020Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5826300Z "size": 32, 2025-12-04T08:58:31.5826673Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5827011Z }, 2025-12-04T08:58:31.5827139Z { 2025-12-04T08:58:31.5827368Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5827646Z "size": 157, 2025-12-04T08:58:31.5827929Z "digest": "sha256:b8babeff6d817a5961dddc15c6bdfdbd05da187fae75d5804015f99fd7c066d8" 2025-12-04T08:58:31.5828262Z }, 2025-12-04T08:58:31.5828389Z { 2025-12-04T08:58:31.5828602Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5828878Z "size": 602, 2025-12-04T08:58:31.5829158Z "digest": "sha256:83779ddf6a85ab387f64a45f274cba245b69e4fd1931ff0b5d7d3efd4b7a43bc" 2025-12-04T08:58:31.5829475Z }, 2025-12-04T08:58:31.5829596Z { 2025-12-04T08:58:31.5829900Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5830179Z "size": 724, 2025-12-04T08:58:31.5830439Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T08:58:31.5830758Z }, 2025-12-04T08:58:31.5830883Z { 2025-12-04T08:58:31.5831094Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5831373Z "size": 155, 2025-12-04T08:58:31.5831649Z "digest": "sha256:8b7620c0d736cc79381207ce5afe2af90f0cd7f0cd394577d2c9520d7f74762f" 2025-12-04T08:58:31.5831962Z }, 2025-12-04T08:58:31.5832103Z { 2025-12-04T08:58:31.5832322Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5832600Z "size": 32, 2025-12-04T08:58:31.5832874Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5833193Z }, 2025-12-04T08:58:31.5833325Z { 2025-12-04T08:58:31.5833548Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5833839Z "size": 188, 2025-12-04T08:58:31.5834122Z "digest": "sha256:3bcfa090e4efd3677425f76baea9f1e0c50a75d8c6b5713ec05310f1dff24539" 2025-12-04T08:58:31.5834444Z }, 2025-12-04T08:58:31.5834574Z { 2025-12-04T08:58:31.5834793Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5835065Z "size": 1370, 2025-12-04T08:58:31.5835353Z "digest": "sha256:eb0504ec4d9218a79896b604f73dc0ea5a0f96266ad9c2cdbbbe5f0f18222694" 2025-12-04T08:58:31.5835680Z }, 2025-12-04T08:58:31.5835814Z { 2025-12-04T08:58:31.5836042Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5836325Z "size": 32, 2025-12-04T08:58:31.5836603Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5836918Z }, 2025-12-04T08:58:31.5837053Z { 2025-12-04T08:58:31.5837275Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5837559Z "size": 136, 2025-12-04T08:58:31.5837843Z "digest": "sha256:15d0fec09d7b196a1462d51516ee90fc3443ba178d3e56d59cacf32146b4321d" 2025-12-04T08:58:31.5838172Z }, 2025-12-04T08:58:31.5838297Z { 2025-12-04T08:58:31.5838518Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5838799Z "size": 528, 2025-12-04T08:58:31.5839074Z "digest": "sha256:cca81fcc62a949959ca4dd3c9056fb293d548ef8607127eeeef6cfd3a8897ca8" 2025-12-04T08:58:31.5839401Z }, 2025-12-04T08:58:31.5839529Z { 2025-12-04T08:58:31.5839742Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5840020Z "size": 32, 2025-12-04T08:58:31.5840297Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5840618Z }, 2025-12-04T08:58:31.5840740Z { 2025-12-04T08:58:31.5840956Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5841233Z "size": 104, 2025-12-04T08:58:31.5841514Z "digest": "sha256:b0b8f9b5c6ab98db9cd830dc584e1b6aec9add139e4cc48d8c243d36691e25b4" 2025-12-04T08:58:31.5841844Z }, 2025-12-04T08:58:31.5842087Z { 2025-12-04T08:58:31.5842303Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5842583Z "size": 435, 2025-12-04T08:58:31.5842857Z "digest": "sha256:0606ca4d47a8a70e91e92b03ca51a85e731641b09342136a54ef2f2a6d9dfb44" 2025-12-04T08:58:31.5843168Z }, 2025-12-04T08:58:31.5843309Z { 2025-12-04T08:58:31.5843530Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5843812Z "size": 32, 2025-12-04T08:58:31.5844082Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5844404Z }, 2025-12-04T08:58:31.5844531Z { 2025-12-04T08:58:31.5844742Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5845019Z "size": 109, 2025-12-04T08:58:31.5845378Z "digest": "sha256:2f80a4e1b3b95ed67bb781ea787e8a63e46de79117d9d8e65c257072b38afa2d" 2025-12-04T08:58:31.5845696Z }, 2025-12-04T08:58:31.5845834Z { 2025-12-04T08:58:31.5846055Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5846330Z "size": 1896, 2025-12-04T08:58:31.5846631Z "digest": "sha256:35c916fb1bd057e517dcab78c3a2a018e68096d8993892ad84f47562d37ae352" 2025-12-04T08:58:31.5846952Z }, 2025-12-04T08:58:31.5847075Z { 2025-12-04T08:58:31.5847298Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5847590Z "size": 197526165, 2025-12-04T08:58:31.5847879Z "digest": "sha256:195537b7dafc96192f768323b1a8cc2a914d41959849b73198579576b0872a44" 2025-12-04T08:58:31.5848186Z }, 2025-12-04T08:58:31.5848316Z { 2025-12-04T08:58:31.5848536Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5848811Z "size": 106, 2025-12-04T08:58:31.5849093Z "digest": "sha256:dc454fd3967e5735b2498b7f1d958a2c626987d5e4ce225ca98da3cd945b59f3" 2025-12-04T08:58:31.5849428Z }, 2025-12-04T08:58:31.5849555Z { 2025-12-04T08:58:31.5849792Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5850091Z "size": 165, 2025-12-04T08:58:31.5850367Z "digest": "sha256:701b34f115fa897181c046dc37288e87cbc3ad74c36a9e2224b5bfe7c5703afb" 2025-12-04T08:58:31.5850691Z }, 2025-12-04T08:58:31.5850822Z { 2025-12-04T08:58:31.5851040Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5851323Z "size": 7944, 2025-12-04T08:58:31.5851609Z "digest": "sha256:39cefc00ffedebc9098261c798408b87a20c95a88fccb110594077f48dadf760" 2025-12-04T08:58:31.5851940Z }, 2025-12-04T08:58:31.5852063Z { 2025-12-04T08:58:31.5852359Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5852642Z "size": 8071, 2025-12-04T08:58:31.5852914Z "digest": "sha256:6ae51eb61a325b2c2995a5088c81aa20821b75be65b5aa722c7c40556b5d03ea" 2025-12-04T08:58:31.5853231Z }, 2025-12-04T08:58:31.5853372Z { 2025-12-04T08:58:31.5853588Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5853875Z "size": 304, 2025-12-04T08:58:31.5854156Z "digest": "sha256:1fd5341e66dfc0c1ae23af014641a92a6fd02640c528fe6d4dc55921ed659a26" 2025-12-04T08:58:31.5854471Z }, 2025-12-04T08:58:31.5854597Z { 2025-12-04T08:58:31.5854816Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5855095Z "size": 13364291, 2025-12-04T08:58:31.5855380Z "digest": "sha256:72a7c87e35e40ab796f90aee1b51add7902f0cdc44406d2505b6c6a1f55a8da6" 2025-12-04T08:58:31.5855700Z }, 2025-12-04T08:58:31.5855827Z { 2025-12-04T08:58:31.5856039Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5856318Z "size": 108, 2025-12-04T08:58:31.5856603Z "digest": "sha256:ec36862ac98ebaac52ee1a8b1d162d45bd0e3bf59ae7e19c8f80ad3960b4c600" 2025-12-04T08:58:31.5856925Z }, 2025-12-04T08:58:31.5857054Z { 2025-12-04T08:58:31.5857282Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5857557Z "size": 54145699, 2025-12-04T08:58:31.5857941Z "digest": "sha256:05ddbf246e8add0e293474dbf88bb028d5a295a25ac59e8648a18db644377773" 2025-12-04T08:58:31.5858265Z }, 2025-12-04T08:58:31.5858387Z { 2025-12-04T08:58:31.5858608Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:58:31.5858887Z "size": 32, 2025-12-04T08:58:31.5859173Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:58:31.5859491Z } 2025-12-04T08:58:31.5859620Z ] 2025-12-04T08:58:31.5859764Z } 2025-12-04T08:58:31.5859908Z + exit 0 2025-12-04T08:58:31.5883645Z ##[group]Run set -eux 2025-12-04T08:58:31.5883862Z set -eux 2025-12-04T08:58:31.5884168Z # It's ok if this steps fails, it would then be an anonymous user like what we used to have 2025-12-04T08:58:31.5885107Z aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token | jq --raw-output '.SecretString' | jq -r .docker_hub_readonly_token | docker login --username pytorchbot --password-stdin || true 2025-12-04T08:58:31.5893735Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:31.5894023Z env: 2025-12-04T08:58:31.5894176Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:31.5894360Z ##[endgroup] 2025-12-04T08:58:31.5923500Z + aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token 2025-12-04T08:58:31.5924480Z + jq --raw-output .SecretString 2025-12-04T08:58:31.5925509Z + jq -r .docker_hub_readonly_token 2025-12-04T08:58:31.5926652Z + docker login --username pytorchbot --password-stdin 2025-12-04T08:58:32.0896953Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T08:58:32.0898444Z Configure a credential helper to remove this warning. See 2025-12-04T08:58:32.0898958Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T08:58:32.0899352Z 2025-12-04T08:58:32.0899940Z Login Succeeded 2025-12-04T08:58:32.0981256Z ##[group]Run tag=${ECR_DOCKER_IMAGE##*:} 2025-12-04T08:58:32.0981592Z tag=${ECR_DOCKER_IMAGE##*:} 2025-12-04T08:58:32.0981888Z echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}" 2025-12-04T08:58:32.0989639Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:32.0989917Z env: 2025-12-04T08:58:32.0990076Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:32.0990682Z ECR_DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:32.0991293Z ##[endgroup] 2025-12-04T08:58:32.1020279Z docker pull ghcr.io/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:32.1058640Z ##[group]Run pytorch/test-infra/.github/actions/pull-docker-image@main 2025-12-04T08:58:32.1058990Z with: 2025-12-04T08:58:32.1059562Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:32.1060271Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:32.1060542Z env: 2025-12-04T08:58:32.1060688Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:32.1060876Z ##[endgroup] 2025-12-04T08:58:32.1074392Z ##[group]Run set -x 2025-12-04T08:58:32.1074583Z set -x 2025-12-04T08:58:32.1074746Z set +e 2025-12-04T08:58:32.1074899Z  2025-12-04T08:58:32.1075046Z login() { 2025-12-04T08:58:32.1075376Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T08:58:32.1075750Z } 2025-12-04T08:58:32.1075896Z  2025-12-04T08:58:32.1076067Z retry () { 2025-12-04T08:58:32.1076265Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T08:58:32.1076485Z } 2025-12-04T08:58:32.1076631Z  2025-12-04T08:58:32.1076792Z retry login "${DOCKER_REGISTRY}" 2025-12-04T08:58:32.1076998Z  2025-12-04T08:58:32.1077524Z IMAGE_SIZE=$(docker manifest inspect "${DOCKER_IMAGE}" | jq '[.layers[].size, .config.size] | add / 1024 / 1024') 2025-12-04T08:58:32.1077985Z echo "Compressed size of image in MB: ${IMAGE_SIZE}" 2025-12-04T08:58:32.1078235Z  2025-12-04T08:58:32.1078383Z set -e 2025-12-04T08:58:32.1078629Z # ignore output since only exit code is used for conditional 2025-12-04T08:58:32.1078973Z # only pull docker image if it's not available locally 2025-12-04T08:58:32.1079350Z if ! docker inspect --type=image "${DOCKER_IMAGE}" >/dev/null 2>/dev/null; then 2025-12-04T08:58:32.1079704Z  retry docker pull "${DOCKER_IMAGE}" 2025-12-04T08:58:32.1079943Z fi 2025-12-04T08:58:32.1086777Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:58:32.1087071Z env: 2025-12-04T08:58:32.1087241Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:58:32.1087848Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:32.1088537Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:32.1088816Z ##[endgroup] 2025-12-04T08:58:32.1112764Z + set +e 2025-12-04T08:58:32.1113160Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:32.1113590Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:32.1116219Z + aws ecr get-login-password --region us-east-1 2025-12-04T08:58:32.1117468Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:58:32.5685547Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T08:58:32.5686135Z Configure a credential helper to remove this warning. See 2025-12-04T08:58:32.5686644Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T08:58:32.5686985Z 2025-12-04T08:58:32.5687523Z Login Succeeded 2025-12-04T08:58:32.5710349Z ++ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:32.5711277Z ++ jq '[.layers[].size, .config.size] | add / 1024 / 1024' 2025-12-04T08:58:32.7911297Z + IMAGE_SIZE=15091.581844329834 2025-12-04T08:58:32.7911888Z + echo 'Compressed size of image in MB: 15091.581844329834' 2025-12-04T08:58:32.7912236Z + set -e 2025-12-04T08:58:32.7912481Z Compressed size of image in MB: 15091.581844329834 2025-12-04T08:58:32.7913712Z + docker inspect --type=image 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:32.8030354Z + retry docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:32.8031436Z + docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:58:32.9909474Z pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a: Pulling from pytorch/ci-image 2025-12-04T08:58:32.9910738Z 63e5bc7682b8: Pulling fs layer 2025-12-04T08:58:32.9911203Z 0678d56345c9: Pulling fs layer 2025-12-04T08:58:32.9911622Z 45f5c9ddfce7: Pulling fs layer 2025-12-04T08:58:32.9912057Z 086b1df51ac1: Pulling fs layer 2025-12-04T08:58:32.9912485Z fe8a7b64bf98: Pulling fs layer 2025-12-04T08:58:32.9912904Z 7680723e9a57: Pulling fs layer 2025-12-04T08:58:32.9913367Z 9c5027aeeb4e: Pulling fs layer 2025-12-04T08:58:32.9913790Z 9a5652110360: Pulling fs layer 2025-12-04T08:58:32.9914216Z 375c4427e914: Pulling fs layer 2025-12-04T08:58:32.9914623Z 086b1df51ac1: Waiting 2025-12-04T08:58:32.9915016Z a86faaa7dbdd: Pulling fs layer 2025-12-04T08:58:32.9915432Z fb7848686804: Pulling fs layer 2025-12-04T08:58:32.9916166Z fe8a7b64bf98: Waiting 2025-12-04T08:58:32.9916566Z 3541df015cdb: Pulling fs layer 2025-12-04T08:58:32.9916990Z 7680723e9a57: Waiting 2025-12-04T08:58:32.9917333Z 9c5027aeeb4e: Waiting 2025-12-04T08:58:32.9917706Z 79dc80f426b2: Pulling fs layer 2025-12-04T08:58:32.9918106Z a13fcc1b90bb: Pulling fs layer 2025-12-04T08:58:32.9918483Z 4f4fb700ef54: Pulling fs layer 2025-12-04T08:58:32.9918803Z 375c4427e914: Waiting 2025-12-04T08:58:32.9919081Z 9a5652110360: Waiting 2025-12-04T08:58:32.9919355Z a86faaa7dbdd: Waiting 2025-12-04T08:58:32.9919662Z 549db4d6c618: Pulling fs layer 2025-12-04T08:58:32.9919986Z 79dc80f426b2: Waiting 2025-12-04T08:58:32.9920266Z 5c63528cb580: Pulling fs layer 2025-12-04T08:58:32.9920548Z a13fcc1b90bb: Waiting 2025-12-04T08:58:32.9920721Z 75bd83b989a4: Pulling fs layer 2025-12-04T08:58:32.9920912Z 3541df015cdb: Waiting 2025-12-04T08:58:32.9921078Z fb7848686804: Waiting 2025-12-04T08:58:32.9921240Z 4f4fb700ef54: Waiting 2025-12-04T08:58:32.9921401Z 5c63528cb580: Waiting 2025-12-04T08:58:32.9921573Z 549db4d6c618: Waiting 2025-12-04T08:58:32.9921771Z de6e78970f51: Pulling fs layer 2025-12-04T08:58:32.9921968Z e13ed7c7e473: Pulling fs layer 2025-12-04T08:58:32.9922147Z de6e78970f51: Waiting 2025-12-04T08:58:32.9922315Z 75bd83b989a4: Waiting 2025-12-04T08:58:32.9922494Z 6e2949bcb741: Pulling fs layer 2025-12-04T08:58:32.9922690Z 14d69d9aaec7: Pulling fs layer 2025-12-04T08:58:32.9922875Z 5c02769dd8e5: Pulling fs layer 2025-12-04T08:58:32.9923065Z 35041ce524ac: Pulling fs layer 2025-12-04T08:58:32.9923260Z 2fa92dc5885e: Pulling fs layer 2025-12-04T08:58:32.9923439Z e13ed7c7e473: Waiting 2025-12-04T08:58:32.9923599Z 5c02769dd8e5: Waiting 2025-12-04T08:58:32.9923770Z 2b85eafbd92a: Pulling fs layer 2025-12-04T08:58:32.9923964Z ff755a4ddad7: Pulling fs layer 2025-12-04T08:58:32.9924164Z 09eb41bdf42d: Pulling fs layer 2025-12-04T08:58:32.9924349Z 11ede4d59e93: Pulling fs layer 2025-12-04T08:58:32.9924673Z 2b85eafbd92a: Waiting 2025-12-04T08:58:32.9924908Z 2fa92dc5885e: Waiting 2025-12-04T08:58:32.9925078Z 09eb41bdf42d: Waiting 2025-12-04T08:58:32.9925229Z ff755a4ddad7: Waiting 2025-12-04T08:58:32.9925385Z 14d69d9aaec7: Waiting 2025-12-04T08:58:32.9925553Z 1283cd8f801a: Pulling fs layer 2025-12-04T08:58:32.9925736Z 024fa855425f: Pulling fs layer 2025-12-04T08:58:32.9925915Z 6e2949bcb741: Waiting 2025-12-04T08:58:32.9926073Z 11ede4d59e93: Waiting 2025-12-04T08:58:32.9926230Z 303e6747a62e: Pulling fs layer 2025-12-04T08:58:32.9926408Z 1283cd8f801a: Waiting 2025-12-04T08:58:32.9926565Z 024fa855425f: Waiting 2025-12-04T08:58:32.9926729Z 3017cdf4838b: Pulling fs layer 2025-12-04T08:58:32.9926902Z 35041ce524ac: Waiting 2025-12-04T08:58:32.9927067Z 6b6cd1c358e8: Pulling fs layer 2025-12-04T08:58:32.9927451Z 303e6747a62e: Waiting 2025-12-04T08:58:32.9927634Z b2dd04501124: Pulling fs layer 2025-12-04T08:58:32.9927828Z 55adc51fe589: Pulling fs layer 2025-12-04T08:58:32.9928008Z b2dd04501124: Waiting 2025-12-04T08:58:32.9928160Z 55adc51fe589: Waiting 2025-12-04T08:58:32.9928326Z a43ca0e4b837: Pulling fs layer 2025-12-04T08:58:32.9928521Z b7212f17fd14: Pulling fs layer 2025-12-04T08:58:32.9928693Z a43ca0e4b837: Waiting 2025-12-04T08:58:32.9928852Z 6b6cd1c358e8: Waiting 2025-12-04T08:58:32.9929020Z 3017cdf4838b: Waiting 2025-12-04T08:58:32.9929181Z 083e42cac090: Pulling fs layer 2025-12-04T08:58:32.9929368Z 0a00b784a4aa: Pulling fs layer 2025-12-04T08:58:32.9929555Z c6173c779f7b: Pulling fs layer 2025-12-04T08:58:32.9929725Z b7212f17fd14: Waiting 2025-12-04T08:58:32.9929889Z ed3d1e3387b9: Pulling fs layer 2025-12-04T08:58:32.9930075Z b29343478586: Pulling fs layer 2025-12-04T08:58:32.9930254Z 083e42cac090: Waiting 2025-12-04T08:58:32.9930409Z 0a00b784a4aa: Waiting 2025-12-04T08:58:32.9930572Z c6173c779f7b: Waiting 2025-12-04T08:58:32.9930741Z c6f0520487fb: Pulling fs layer 2025-12-04T08:58:32.9930915Z ed3d1e3387b9: Waiting 2025-12-04T08:58:32.9931071Z b29343478586: Waiting 2025-12-04T08:58:32.9931233Z 148171691cd4: Pulling fs layer 2025-12-04T08:58:32.9931419Z 2c666d30ed77: Pulling fs layer 2025-12-04T08:58:32.9931708Z 5d8d3a0a98e0: Pulling fs layer 2025-12-04T08:58:32.9931888Z 2c666d30ed77: Waiting 2025-12-04T08:58:32.9932040Z c6f0520487fb: Waiting 2025-12-04T08:58:32.9932207Z b06bafce9e81: Pulling fs layer 2025-12-04T08:58:32.9932509Z 148171691cd4: Waiting 2025-12-04T08:58:32.9932656Z 5d8d3a0a98e0: Waiting 2025-12-04T08:58:32.9932812Z b06bafce9e81: Waiting 2025-12-04T08:58:32.9933009Z 15e0d7e4590d: Pulling fs layer 2025-12-04T08:58:32.9933284Z a514bd1add31: Pulling fs layer 2025-12-04T08:58:32.9933477Z 57b84ee60002: Pulling fs layer 2025-12-04T08:58:32.9933678Z 15e0d7e4590d: Waiting 2025-12-04T08:58:32.9933935Z 57b84ee60002: Waiting 2025-12-04T08:58:32.9934106Z b8babeff6d81: Pulling fs layer 2025-12-04T08:58:32.9934397Z 83779ddf6a85: Pulling fs layer 2025-12-04T08:58:32.9934591Z 8b7620c0d736: Pulling fs layer 2025-12-04T08:58:32.9934780Z 3bcfa090e4ef: Pulling fs layer 2025-12-04T08:58:32.9934968Z eb0504ec4d92: Pulling fs layer 2025-12-04T08:58:32.9935151Z b8babeff6d81: Waiting 2025-12-04T08:58:32.9935306Z 8b7620c0d736: Waiting 2025-12-04T08:58:32.9935464Z 83779ddf6a85: Waiting 2025-12-04T08:58:32.9935638Z 3bcfa090e4ef: Waiting 2025-12-04T08:58:32.9935807Z 15d0fec09d7b: Pulling fs layer 2025-12-04T08:58:32.9935997Z eb0504ec4d92: Waiting 2025-12-04T08:58:32.9936157Z 15d0fec09d7b: Waiting 2025-12-04T08:58:32.9936320Z cca81fcc62a9: Pulling fs layer 2025-12-04T08:58:32.9936513Z b0b8f9b5c6ab: Pulling fs layer 2025-12-04T08:58:32.9936701Z 0606ca4d47a8: Pulling fs layer 2025-12-04T08:58:32.9936872Z cca81fcc62a9: Waiting 2025-12-04T08:58:32.9937029Z b0b8f9b5c6ab: Waiting 2025-12-04T08:58:32.9937196Z 2f80a4e1b3b9: Pulling fs layer 2025-12-04T08:58:32.9937369Z 0606ca4d47a8: Waiting 2025-12-04T08:58:32.9937537Z 35c916fb1bd0: Pulling fs layer 2025-12-04T08:58:32.9937726Z 2f80a4e1b3b9: Waiting 2025-12-04T08:58:32.9937887Z 195537b7dafc: Pulling fs layer 2025-12-04T08:58:32.9938078Z dc454fd3967e: Pulling fs layer 2025-12-04T08:58:32.9938265Z 701b34f115fa: Pulling fs layer 2025-12-04T08:58:32.9949385Z 39cefc00ffed: Pulling fs layer 2025-12-04T08:58:32.9949685Z 6ae51eb61a32: Pulling fs layer 2025-12-04T08:58:32.9949888Z 35c916fb1bd0: Waiting 2025-12-04T08:58:32.9950071Z dc454fd3967e: Waiting 2025-12-04T08:58:32.9950242Z 195537b7dafc: Waiting 2025-12-04T08:58:32.9950436Z 1fd5341e66df: Pulling fs layer 2025-12-04T08:58:32.9950631Z 701b34f115fa: Waiting 2025-12-04T08:58:32.9950812Z 72a7c87e35e4: Pulling fs layer 2025-12-04T08:58:32.9951009Z ec36862ac98e: Pulling fs layer 2025-12-04T08:58:32.9951203Z 05ddbf246e8a: Pulling fs layer 2025-12-04T08:58:32.9951377Z 72a7c87e35e4: Waiting 2025-12-04T08:58:32.9951534Z ec36862ac98e: Waiting 2025-12-04T08:58:32.9951696Z 39cefc00ffed: Waiting 2025-12-04T08:58:32.9951914Z 1fd5341e66df: Waiting 2025-12-04T08:58:32.9952226Z 6ae51eb61a32: Waiting 2025-12-04T08:58:33.0786766Z 0678d56345c9: Verifying Checksum 2025-12-04T08:58:33.0787091Z 0678d56345c9: Download complete 2025-12-04T08:58:33.1706541Z 086b1df51ac1: Download complete 2025-12-04T08:58:33.2393353Z fe8a7b64bf98: Verifying Checksum 2025-12-04T08:58:33.2393839Z fe8a7b64bf98: Download complete 2025-12-04T08:58:33.3250617Z 7680723e9a57: Verifying Checksum 2025-12-04T08:58:33.3250907Z 7680723e9a57: Download complete 2025-12-04T08:58:33.3380553Z 63e5bc7682b8: Verifying Checksum 2025-12-04T08:58:33.3381061Z 63e5bc7682b8: Download complete 2025-12-04T08:58:33.3992012Z 9c5027aeeb4e: Verifying Checksum 2025-12-04T08:58:33.3992510Z 9c5027aeeb4e: Download complete 2025-12-04T08:58:33.4245033Z 9a5652110360: Download complete 2025-12-04T08:58:33.4915166Z a86faaa7dbdd: Verifying Checksum 2025-12-04T08:58:33.4915525Z a86faaa7dbdd: Download complete 2025-12-04T08:58:33.5840086Z fb7848686804: Verifying Checksum 2025-12-04T08:58:33.5840663Z fb7848686804: Download complete 2025-12-04T08:58:33.6929766Z 3541df015cdb: Verifying Checksum 2025-12-04T08:58:33.6930116Z 3541df015cdb: Download complete 2025-12-04T08:58:33.7690791Z 79dc80f426b2: Verifying Checksum 2025-12-04T08:58:33.7691291Z 79dc80f426b2: Download complete 2025-12-04T08:58:34.2004048Z 63e5bc7682b8: Pull complete 2025-12-04T08:58:34.2162140Z 0678d56345c9: Pull complete 2025-12-04T08:58:34.5640776Z 375c4427e914: Verifying Checksum 2025-12-04T08:58:34.5641145Z 375c4427e914: Download complete 2025-12-04T08:58:34.5721108Z 4f4fb700ef54: Verifying Checksum 2025-12-04T08:58:34.5721653Z 4f4fb700ef54: Download complete 2025-12-04T08:58:34.6619605Z 549db4d6c618: Verifying Checksum 2025-12-04T08:58:34.6619949Z 549db4d6c618: Download complete 2025-12-04T08:58:34.7347167Z 5c63528cb580: Download complete 2025-12-04T08:58:34.8039444Z 75bd83b989a4: Verifying Checksum 2025-12-04T08:58:34.8039809Z 75bd83b989a4: Download complete 2025-12-04T08:58:34.8908409Z de6e78970f51: Verifying Checksum 2025-12-04T08:58:34.8908781Z de6e78970f51: Download complete 2025-12-04T08:58:34.9688233Z e13ed7c7e473: Verifying Checksum 2025-12-04T08:58:34.9688494Z e13ed7c7e473: Download complete 2025-12-04T08:58:35.0479291Z 6e2949bcb741: Verifying Checksum 2025-12-04T08:58:35.0479588Z 6e2949bcb741: Download complete 2025-12-04T08:58:35.1290718Z 14d69d9aaec7: Download complete 2025-12-04T08:58:35.1874861Z 5c02769dd8e5: Download complete 2025-12-04T08:58:36.1655931Z 45f5c9ddfce7: Verifying Checksum 2025-12-04T08:58:36.1656260Z 45f5c9ddfce7: Download complete 2025-12-04T08:58:36.2550569Z 2fa92dc5885e: Verifying Checksum 2025-12-04T08:58:36.2551028Z 2fa92dc5885e: Download complete 2025-12-04T08:58:36.6253834Z 2b85eafbd92a: Verifying Checksum 2025-12-04T08:58:36.6254291Z 2b85eafbd92a: Download complete 2025-12-04T08:58:36.7147017Z ff755a4ddad7: Download complete 2025-12-04T08:58:36.8475830Z 09eb41bdf42d: Download complete 2025-12-04T08:58:41.4734021Z 11ede4d59e93: Verifying Checksum 2025-12-04T08:58:41.4734480Z 11ede4d59e93: Download complete 2025-12-04T08:58:41.5575200Z 1283cd8f801a: Verifying Checksum 2025-12-04T08:58:41.5575533Z 1283cd8f801a: Download complete 2025-12-04T08:58:41.6449902Z 024fa855425f: Verifying Checksum 2025-12-04T08:58:41.6450378Z 024fa855425f: Download complete 2025-12-04T08:58:41.7196879Z 303e6747a62e: Verifying Checksum 2025-12-04T08:58:41.7197258Z 303e6747a62e: Download complete 2025-12-04T08:58:41.7773202Z 3017cdf4838b: Download complete 2025-12-04T08:58:42.0169883Z 6b6cd1c358e8: Verifying Checksum 2025-12-04T08:58:42.1098083Z b2dd04501124: Verifying Checksum 2025-12-04T08:58:42.1098503Z b2dd04501124: Download complete 2025-12-04T08:58:42.1702724Z 55adc51fe589: Download complete 2025-12-04T08:58:42.2544701Z a43ca0e4b837: Verifying Checksum 2025-12-04T08:58:42.2544988Z a43ca0e4b837: Download complete 2025-12-04T08:58:42.3317790Z b7212f17fd14: Verifying Checksum 2025-12-04T08:58:42.3318154Z b7212f17fd14: Download complete 2025-12-04T08:58:42.4115609Z 083e42cac090: Verifying Checksum 2025-12-04T08:58:42.4116014Z 083e42cac090: Download complete 2025-12-04T08:58:42.5033331Z 0a00b784a4aa: Verifying Checksum 2025-12-04T08:58:42.5033795Z 0a00b784a4aa: Download complete 2025-12-04T08:58:42.5750220Z c6173c779f7b: Verifying Checksum 2025-12-04T08:58:42.5750581Z c6173c779f7b: Download complete 2025-12-04T08:58:43.0725309Z 45f5c9ddfce7: Pull complete 2025-12-04T08:58:43.0901338Z 086b1df51ac1: Pull complete 2025-12-04T08:58:43.1056367Z fe8a7b64bf98: Pull complete 2025-12-04T08:58:43.1213800Z 7680723e9a57: Pull complete 2025-12-04T08:58:43.1389514Z 9c5027aeeb4e: Pull complete 2025-12-04T08:58:43.1545400Z 9a5652110360: Pull complete 2025-12-04T08:58:44.0863371Z ed3d1e3387b9: Verifying Checksum 2025-12-04T08:58:44.0863781Z ed3d1e3387b9: Download complete 2025-12-04T08:58:44.1624739Z b29343478586: Verifying Checksum 2025-12-04T08:58:44.1625084Z b29343478586: Download complete 2025-12-04T08:58:45.1718810Z 375c4427e914: Pull complete 2025-12-04T08:58:45.3226439Z a86faaa7dbdd: Pull complete 2025-12-04T08:58:45.4232305Z fb7848686804: Pull complete 2025-12-04T08:58:45.4943655Z 3541df015cdb: Pull complete 2025-12-04T08:58:45.5958000Z 79dc80f426b2: Pull complete 2025-12-04T08:58:47.3329859Z c6f0520487fb: Verifying Checksum 2025-12-04T08:58:47.3330205Z c6f0520487fb: Download complete 2025-12-04T08:59:05.6767374Z a13fcc1b90bb: Verifying Checksum 2025-12-04T08:59:05.6767778Z a13fcc1b90bb: Download complete 2025-12-04T08:59:05.7552370Z 2c666d30ed77: Download complete 2025-12-04T08:59:05.8284306Z 5d8d3a0a98e0: Verifying Checksum 2025-12-04T08:59:05.8284646Z 5d8d3a0a98e0: Download complete 2025-12-04T08:59:05.8952765Z b06bafce9e81: Verifying Checksum 2025-12-04T08:59:05.8953114Z b06bafce9e81: Download complete 2025-12-04T08:59:05.9700101Z 15e0d7e4590d: Verifying Checksum 2025-12-04T08:59:05.9700444Z 15e0d7e4590d: Download complete 2025-12-04T08:59:06.0633119Z a514bd1add31: Download complete 2025-12-04T08:59:06.1363186Z 57b84ee60002: Verifying Checksum 2025-12-04T08:59:06.1363541Z 57b84ee60002: Download complete 2025-12-04T08:59:06.2059132Z b8babeff6d81: Verifying Checksum 2025-12-04T08:59:06.2059476Z b8babeff6d81: Download complete 2025-12-04T08:59:06.3078640Z 83779ddf6a85: Download complete 2025-12-04T08:59:06.3713636Z 8b7620c0d736: Verifying Checksum 2025-12-04T08:59:06.3714012Z 8b7620c0d736: Download complete 2025-12-04T08:59:06.4583555Z 3bcfa090e4ef: Verifying Checksum 2025-12-04T08:59:06.4584419Z 3bcfa090e4ef: Download complete 2025-12-04T08:59:06.5428663Z eb0504ec4d92: Verifying Checksum 2025-12-04T08:59:06.5429037Z eb0504ec4d92: Download complete 2025-12-04T08:59:06.6014014Z 15d0fec09d7b: Verifying Checksum 2025-12-04T08:59:06.6014460Z 15d0fec09d7b: Download complete 2025-12-04T08:59:06.6807612Z cca81fcc62a9: Verifying Checksum 2025-12-04T08:59:06.6808072Z cca81fcc62a9: Download complete 2025-12-04T08:59:06.7529971Z b0b8f9b5c6ab: Verifying Checksum 2025-12-04T08:59:06.7530478Z b0b8f9b5c6ab: Download complete 2025-12-04T08:59:06.8302883Z 0606ca4d47a8: Verifying Checksum 2025-12-04T08:59:06.8303291Z 0606ca4d47a8: Download complete 2025-12-04T08:59:06.9244821Z 2f80a4e1b3b9: Download complete 2025-12-04T08:59:09.0354063Z 195537b7dafc: Verifying Checksum 2025-12-04T08:59:09.0354400Z 195537b7dafc: Download complete 2025-12-04T08:59:09.1102519Z dc454fd3967e: Verifying Checksum 2025-12-04T08:59:09.1102935Z dc454fd3967e: Download complete 2025-12-04T08:59:09.1833612Z 701b34f115fa: Verifying Checksum 2025-12-04T08:59:09.1835602Z 701b34f115fa: Download complete 2025-12-04T08:59:09.2384717Z 39cefc00ffed: Download complete 2025-12-04T08:59:09.3153000Z 6ae51eb61a32: Verifying Checksum 2025-12-04T08:59:09.3153345Z 6ae51eb61a32: Download complete 2025-12-04T08:59:09.3976401Z 1fd5341e66df: Verifying Checksum 2025-12-04T08:59:09.3976766Z 1fd5341e66df: Download complete 2025-12-04T08:59:09.5685316Z 72a7c87e35e4: Verifying Checksum 2025-12-04T08:59:09.5685801Z 72a7c87e35e4: Download complete 2025-12-04T08:59:09.6326354Z ec36862ac98e: Verifying Checksum 2025-12-04T08:59:09.6326756Z ec36862ac98e: Download complete 2025-12-04T08:59:10.2218102Z 05ddbf246e8a: Verifying Checksum 2025-12-04T08:59:10.2218863Z 05ddbf246e8a: Download complete 2025-12-04T08:59:17.9573101Z 148171691cd4: Verifying Checksum 2025-12-04T08:59:17.9573473Z 148171691cd4: Download complete 2025-12-04T08:59:56.2962222Z 35041ce524ac: Verifying Checksum 2025-12-04T08:59:56.2962524Z 35041ce524ac: Download complete 2025-12-04T09:00:25.8638023Z a13fcc1b90bb: Pull complete 2025-12-04T09:00:26.0868008Z 4f4fb700ef54: Pull complete 2025-12-04T09:00:26.2010266Z 549db4d6c618: Pull complete 2025-12-04T09:00:26.3826725Z 5c63528cb580: Pull complete 2025-12-04T09:00:26.5970840Z 75bd83b989a4: Pull complete 2025-12-04T09:00:26.8758012Z de6e78970f51: Pull complete 2025-12-04T09:00:27.0922558Z e13ed7c7e473: Pull complete 2025-12-04T09:00:27.3015198Z 6e2949bcb741: Pull complete 2025-12-04T09:00:27.5196066Z 14d69d9aaec7: Pull complete 2025-12-04T09:00:27.7421563Z 5c02769dd8e5: Pull complete 2025-12-04T09:02:00.9189791Z 35041ce524ac: Pull complete 2025-12-04T09:02:01.1329886Z 2fa92dc5885e: Pull complete 2025-12-04T09:02:01.7149179Z 2b85eafbd92a: Pull complete 2025-12-04T09:02:01.9423398Z ff755a4ddad7: Pull complete 2025-12-04T09:02:02.0411607Z 09eb41bdf42d: Pull complete 2025-12-04T09:02:08.4603793Z 11ede4d59e93: Pull complete 2025-12-04T09:02:08.7026597Z 1283cd8f801a: Pull complete 2025-12-04T09:02:08.9255127Z 024fa855425f: Pull complete 2025-12-04T09:02:09.3763163Z 303e6747a62e: Pull complete 2025-12-04T09:02:09.6029005Z 3017cdf4838b: Pull complete 2025-12-04T09:02:09.9567537Z 6b6cd1c358e8: Pull complete 2025-12-04T09:02:10.1802666Z b2dd04501124: Pull complete 2025-12-04T09:02:10.3980295Z 55adc51fe589: Pull complete 2025-12-04T09:02:10.8590936Z a43ca0e4b837: Pull complete 2025-12-04T09:02:11.0991868Z b7212f17fd14: Pull complete 2025-12-04T09:02:11.3226474Z 083e42cac090: Pull complete 2025-12-04T09:02:11.7670216Z 0a00b784a4aa: Pull complete 2025-12-04T09:02:11.9758828Z c6173c779f7b: Pull complete 2025-12-04T09:02:14.7339781Z ed3d1e3387b9: Pull complete 2025-12-04T09:02:14.9660121Z b29343478586: Pull complete 2025-12-04T09:02:15.9898439Z c6f0520487fb: Pull complete 2025-12-04T09:02:58.9388302Z 148171691cd4: Pull complete 2025-12-04T09:02:59.1266265Z 2c666d30ed77: Pull complete 2025-12-04T09:02:59.3134588Z 5d8d3a0a98e0: Pull complete 2025-12-04T09:02:59.5998511Z b06bafce9e81: Pull complete 2025-12-04T09:02:59.7402568Z 15e0d7e4590d: Pull complete 2025-12-04T09:02:59.7962853Z a514bd1add31: Pull complete 2025-12-04T09:03:00.1391039Z 57b84ee60002: Pull complete 2025-12-04T09:03:00.4880313Z b8babeff6d81: Pull complete 2025-12-04T09:03:00.6555929Z 83779ddf6a85: Pull complete 2025-12-04T09:03:00.9902005Z 8b7620c0d736: Pull complete 2025-12-04T09:03:01.3005147Z 3bcfa090e4ef: Pull complete 2025-12-04T09:03:01.5068701Z eb0504ec4d92: Pull complete 2025-12-04T09:03:01.8981915Z 15d0fec09d7b: Pull complete 2025-12-04T09:03:02.0932090Z cca81fcc62a9: Pull complete 2025-12-04T09:03:02.4518973Z b0b8f9b5c6ab: Pull complete 2025-12-04T09:03:02.6390194Z 0606ca4d47a8: Pull complete 2025-12-04T09:03:03.0023447Z 2f80a4e1b3b9: Pull complete 2025-12-04T09:03:03.2222674Z 35c916fb1bd0: Pull complete 2025-12-04T09:03:08.3396035Z 195537b7dafc: Pull complete 2025-12-04T09:03:08.5711646Z dc454fd3967e: Pull complete 2025-12-04T09:03:08.7980387Z 701b34f115fa: Pull complete 2025-12-04T09:03:09.0034486Z 39cefc00ffed: Pull complete 2025-12-04T09:03:09.2340081Z 6ae51eb61a32: Pull complete 2025-12-04T09:03:09.4553779Z 1fd5341e66df: Pull complete 2025-12-04T09:03:10.8101215Z 72a7c87e35e4: Pull complete 2025-12-04T09:03:11.0261526Z ec36862ac98e: Pull complete 2025-12-04T09:03:12.2729421Z 05ddbf246e8a: Pull complete 2025-12-04T09:03:12.5546034Z Digest: sha256:ba21003510dba4bdeed83df81a56fa468e0ee1b612a9445ae1f402a280804f97 2025-12-04T09:03:12.6000648Z Status: Downloaded newer image for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:03:12.6265554Z 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:03:12.6319357Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:03:12.6320090Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:03:12.6329522Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:03:12.6329797Z env: 2025-12-04T09:03:12.6329961Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:03:12.6330140Z ##[endgroup] 2025-12-04T09:03:12.6488691Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-12-04T09:03:12.6489024Z with: 2025-12-04T09:03:12.6489182Z driver-version: 580.82.07 2025-12-04T09:03:12.6489366Z env: 2025-12-04T09:03:12.6489516Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:03:12.6489694Z ##[endgroup] 2025-12-04T09:03:12.6548616Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:03:12.6549390Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:03:12.6556555Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:03:12.6556832Z env: 2025-12-04T09:03:12.6557204Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:03:12.6557386Z ##[endgroup] 2025-12-04T09:03:12.6645090Z ##[group]Run set -euo pipefail 2025-12-04T09:03:12.6645339Z set -euo pipefail 2025-12-04T09:03:12.6645555Z  2025-12-04T09:03:12.6645718Z has_gpu=false 2025-12-04T09:03:12.6645896Z devices="" 2025-12-04T09:03:12.6646055Z  2025-12-04T09:03:12.6646254Z if command -v nvidia-smi >/dev/null 2>&1; then 2025-12-04T09:03:12.6646596Z  if nvidia-smi -L >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:03:12.6646872Z  has_gpu=true 2025-12-04T09:03:12.6647080Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:03:12.6647297Z  fi 2025-12-04T09:03:12.6647451Z fi 2025-12-04T09:03:12.6647594Z  2025-12-04T09:03:12.6647747Z if [ "$has_gpu" = false ]; then 2025-12-04T09:03:12.6648021Z  if ls /dev/nvidia* >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:03:12.6648295Z  has_gpu=true 2025-12-04T09:03:12.6648502Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:03:12.6648720Z  fi 2025-12-04T09:03:12.6648865Z fi 2025-12-04T09:03:12.6649010Z  2025-12-04T09:03:12.6649225Z if [ "$has_gpu" = false ] && command -v lspci >/dev/null 2>&1; then 2025-12-04T09:03:12.6649580Z  if lspci | grep -i 'nvidia' >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:03:12.6649862Z  has_gpu=true 2025-12-04T09:03:12.6650058Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:03:12.6650274Z  fi 2025-12-04T09:03:12.6650418Z fi 2025-12-04T09:03:12.6650555Z  2025-12-04T09:03:12.6650771Z printf 'HAS_NVIDIA=%s\n' "$has_gpu" >> "$GITHUB_OUTPUT" 2025-12-04T09:03:12.6651144Z printf 'DETECTED_DEVICES<> "$GITHUB_OUTPUT" 2025-12-04T09:03:12.6658311Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:03:12.6658599Z env: 2025-12-04T09:03:12.6658757Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:03:12.6658943Z ##[endgroup] 2025-12-04T09:03:14.2937797Z ##[group]Run if [ "${HAS_NVIDIA}" = "true" ]; then 2025-12-04T09:03:14.2938188Z if [ "${HAS_NVIDIA}" = "true" ]; then 2025-12-04T09:03:14.2938697Z  echo "HAS_NVIDIA_GPU=true" >> "${GITHUB_ENV}" 2025-12-04T09:03:14.2939189Z  echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" 2025-12-04T09:03:14.2939620Z else 2025-12-04T09:03:14.2939876Z  echo "HAS_NVIDIA_GPU=false" >> "${GITHUB_ENV}" 2025-12-04T09:03:14.2940203Z fi 2025-12-04T09:03:14.2949083Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:03:14.2949368Z env: 2025-12-04T09:03:14.2949535Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:03:14.2949733Z HAS_NVIDIA: true 2025-12-04T09:03:14.2961720Z ##[endgroup] 2025-12-04T09:03:14.3033481Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-12-04T09:03:14.3033793Z with: 2025-12-04T09:03:14.3033947Z timeout_minutes: 10 2025-12-04T09:03:14.3034137Z max_attempts: 3 2025-12-04T09:03:14.3053299Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y \ nvidia-container-toolkit-1.17.8 \ libnvidia-container-tools-1.17.8 \ libnvidia-container1-1.17.8 \ nvidia-container-toolkit-base-1.17.8 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-container-toolkit-1.17.8 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi # check if the container-toolkit is correctly installed and CUDA is available inside a container docker run --rm -t --gpus=all public.ecr.aws/docker/library/python:3.13 nvidia-smi 2025-12-04T09:03:14.3072658Z retry_wait_seconds: 10 2025-12-04T09:03:14.3072864Z polling_interval_seconds: 1 2025-12-04T09:03:14.3073056Z warning_on_retry: true 2025-12-04T09:03:14.3073247Z continue_on_error: false 2025-12-04T09:03:14.3073426Z env: 2025-12-04T09:03:14.3073568Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:03:14.3073751Z HAS_NVIDIA_GPU: true 2025-12-04T09:03:14.3073983Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:03:14.3074250Z DRIVER_VERSION: 580.82.07 2025-12-04T09:03:14.3074436Z ##[endgroup] 2025-12-04T09:03:14.3736072Z == Installing nvidia driver NVIDIA-Linux-x86_64-580.82.07.run == 2025-12-04T09:03:14.3739340Z + pre_install_nvidia_driver_amzn2 2025-12-04T09:03:14.3742396Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-12-04T09:03:14.9074091Z No match for argument: nvidia-driver-latest-dkms 2025-12-04T09:03:14.9074959Z No packages marked for removal. 2025-12-04T09:03:14.9128142Z Dependencies resolved. 2025-12-04T09:03:14.9137306Z Nothing to do. 2025-12-04T09:03:14.9138014Z Complete! 2025-12-04T09:03:15.0475885Z + install_nvidia_driver_common 2025-12-04T09:03:15.0478688Z + echo 'Before installing NVIDIA driver' 2025-12-04T09:03:15.0480316Z Before installing NVIDIA driver 2025-12-04T09:03:15.0482252Z + lspci 2025-12-04T09:03:15.1335833Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-12-04T09:03:15.1336370Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-12-04T09:03:15.1336935Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-12-04T09:03:15.1337455Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-12-04T09:03:15.1337934Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-12-04T09:03:15.1338374Z 01:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1338722Z 02:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1339073Z 03:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1339434Z 03:00.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1339756Z 03:00.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1340347Z 03:00.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1340699Z 03:00.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1341023Z 03:00.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1341340Z 03:00.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1341659Z 03:00.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1341977Z 03:01.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1342288Z 03:01.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1342612Z 03:01.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1342942Z 03:01.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1343256Z 03:01.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1343574Z 03:01.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1344190Z 03:01.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1344479Z 03:01.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1344758Z 03:02.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1345045Z 03:02.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1345322Z 03:02.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1345597Z 03:02.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1345871Z 03:02.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1346148Z 03:02.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1346418Z 03:02.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1346707Z 03:02.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1346982Z 03:03.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1347258Z 03:03.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1347528Z 03:03.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1347809Z 03:03.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1348084Z 03:03.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1348368Z 03:03.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1348648Z 03:03.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1348927Z 03:03.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1349204Z 24:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1349474Z 25:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1349750Z 26:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1350025Z 26:00.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1350296Z 26:00.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1350575Z 26:00.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1350853Z 26:00.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1351124Z 26:00.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1351407Z 26:00.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1351703Z 26:00.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1351981Z 26:01.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1352350Z 27:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-12-04T09:03:15.1352722Z 30:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1353000Z 31:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1353274Z 32:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1353638Z 33:00.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-12-04T09:03:15.1354029Z 34:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:15.1354363Z 35:00.0 3D controller: NVIDIA Corporation AD104GL [L4] (rev a1) 2025-12-04T09:03:15.1354652Z + lsmod 2025-12-04T09:03:15.1381577Z Module Size Used by 2025-12-04T09:03:15.1381896Z nvidia_uvm 1925120 0 2025-12-04T09:03:15.1382158Z nvidia 14286848 1 nvidia_uvm 2025-12-04T09:03:15.1382429Z drm 602112 1 nvidia 2025-12-04T09:03:15.1382711Z drm_panel_orientation_quirks 32768 1 drm 2025-12-04T09:03:15.1383232Z backlight 24576 1 drm 2025-12-04T09:03:15.1383514Z i2c_core 110592 2 nvidia,drm 2025-12-04T09:03:15.1383786Z xt_conntrack 16384 1 2025-12-04T09:03:15.1384024Z nft_chain_nat 16384 3 2025-12-04T09:03:15.1384249Z xt_MASQUERADE 20480 1 2025-12-04T09:03:15.1384610Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-12-04T09:03:15.1385181Z nf_conntrack_netlink 57344 0 2025-12-04T09:03:15.1385864Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-12-04T09:03:15.1386637Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-12-04T09:03:15.1387171Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-12-04T09:03:15.1387659Z xfrm_user 57344 1 2025-12-04T09:03:15.1388446Z xfrm_algo 16384 1 xfrm_user 2025-12-04T09:03:15.1388958Z xt_addrtype 16384 2 2025-12-04T09:03:15.1389419Z nft_compat 20480 4 2025-12-04T09:03:15.1389959Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-12-04T09:03:15.1390691Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-12-04T09:03:15.1391346Z br_netfilter 36864 0 2025-12-04T09:03:15.1391815Z bridge 323584 1 br_netfilter 2025-12-04T09:03:15.1392333Z stp 16384 1 bridge 2025-12-04T09:03:15.1392825Z llc 16384 2 bridge,stp 2025-12-04T09:03:15.1393302Z overlay 167936 0 2025-12-04T09:03:15.1393722Z tls 139264 0 2025-12-04T09:03:15.1394137Z nls_ascii 16384 1 2025-12-04T09:03:15.1394547Z nls_cp437 20480 1 2025-12-04T09:03:15.1394951Z vfat 24576 1 2025-12-04T09:03:15.1395200Z fat 86016 1 vfat 2025-12-04T09:03:15.1395398Z sunrpc 700416 1 2025-12-04T09:03:15.1395579Z ena 184320 0 2025-12-04T09:03:15.1395767Z i8042 45056 0 2025-12-04T09:03:15.1395963Z serio 28672 3 i8042 2025-12-04T09:03:15.1396167Z ghash_clmulni_intel 16384 0 2025-12-04T09:03:15.1396364Z button 24576 0 2025-12-04T09:03:15.1396544Z sch_fq_codel 20480 9 2025-12-04T09:03:15.1396720Z dm_mod 188416 0 2025-12-04T09:03:15.1396899Z fuse 184320 1 2025-12-04T09:03:15.1397078Z loop 36864 0 2025-12-04T09:03:15.1397256Z configfs 57344 1 2025-12-04T09:03:15.1397437Z dmi_sysfs 20480 0 2025-12-04T09:03:15.1397619Z crc32_pclmul 16384 0 2025-12-04T09:03:15.1397796Z crc32c_intel 24576 0 2025-12-04T09:03:15.1397973Z efivarfs 24576 1 2025-12-04T09:03:15.1398156Z + modinfo nvidia 2025-12-04T09:03:15.1406330Z filename: /lib/modules/6.1.150-174.273.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-12-04T09:03:15.1406760Z import_ns: DMA_BUF 2025-12-04T09:03:15.1406993Z alias: char-major-195-* 2025-12-04T09:03:15.1407235Z version: 580.82.07 2025-12-04T09:03:15.1407475Z supported: external 2025-12-04T09:03:15.1407703Z license: Dual MIT/GPL 2025-12-04T09:03:15.1407965Z firmware: nvidia/580.82.07/gsp_tu10x.bin 2025-12-04T09:03:15.1408284Z firmware: nvidia/580.82.07/gsp_ga10x.bin 2025-12-04T09:03:15.1408569Z srcversion: BA7240A71DCF7DC6FE88C1D 2025-12-04T09:03:15.1408876Z alias: of:N*T*Cnvidia,tegra264-displayC* 2025-12-04T09:03:15.1409207Z alias: of:N*T*Cnvidia,tegra264-display 2025-12-04T09:03:15.1409517Z alias: of:N*T*Cnvidia,tegra234-displayC* 2025-12-04T09:03:15.1409832Z alias: of:N*T*Cnvidia,tegra234-display 2025-12-04T09:03:15.1410141Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-12-04T09:03:15.1410437Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-12-04T09:03:15.1410733Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-12-04T09:03:15.1411013Z depends: i2c-core,drm 2025-12-04T09:03:15.1411241Z retpoline: Y 2025-12-04T09:03:15.1411580Z name: nvidia 2025-12-04T09:03:15.1411926Z vermagic: 6.1.150-174.273.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-12-04T09:03:15.1412513Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-12-04T09:03:15.1412931Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-12-04T09:03:15.1413319Z parm: NVreg_ResmanDebugLevel:int 2025-12-04T09:03:15.1413608Z parm: NVreg_RmLogonRC:int 2025-12-04T09:03:15.1413887Z parm: NVreg_ModifyDeviceFiles:int 2025-12-04T09:03:15.1414177Z parm: NVreg_DeviceFileUID:int 2025-12-04T09:03:15.1414449Z parm: NVreg_DeviceFileGID:int 2025-12-04T09:03:15.1414692Z parm: NVreg_DeviceFileMode:int 2025-12-04T09:03:15.1414962Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-12-04T09:03:15.1415360Z parm: NVreg_UsePageAttributeTable:int 2025-12-04T09:03:15.1415613Z parm: NVreg_EnablePCIeGen3:int 2025-12-04T09:03:15.1415837Z parm: NVreg_EnableMSI:int 2025-12-04T09:03:15.1416075Z parm: NVreg_EnableStreamMemOPs:int 2025-12-04T09:03:15.1416347Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-12-04T09:03:15.1416638Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-12-04T09:03:15.1416921Z parm: NVreg_EnableS0ixPowerManagement:int 2025-12-04T09:03:15.1417232Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-12-04T09:03:15.1417532Z parm: NVreg_DynamicPowerManagement:int 2025-12-04T09:03:15.1417845Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-12-04T09:03:15.1418147Z parm: NVreg_EnableGpuFirmware:int 2025-12-04T09:03:15.1418401Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-12-04T09:03:15.1418664Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-12-04T09:03:15.1418953Z parm: NVreg_EnableUserNUMAManagement:int 2025-12-04T09:03:15.1419209Z parm: NVreg_MemoryPoolSize:int 2025-12-04T09:03:15.1419443Z parm: NVreg_KMallocHeapMaxSize:int 2025-12-04T09:03:15.1419687Z parm: NVreg_VMallocHeapMaxSize:int 2025-12-04T09:03:15.1419928Z parm: NVreg_IgnoreMMIOCheck:int 2025-12-04T09:03:15.1420154Z parm: NVreg_NvLinkDisable:int 2025-12-04T09:03:15.1420411Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-12-04T09:03:15.1420675Z parm: NVreg_RegisterPCIDriver:int 2025-12-04T09:03:15.1420930Z parm: NVreg_RegisterPlatformDeviceDriver:int 2025-12-04T09:03:15.1421187Z parm: NVreg_EnableResizableBar:int 2025-12-04T09:03:15.1421430Z parm: NVreg_EnableDbgBreakpoint:int 2025-12-04T09:03:15.1421680Z parm: NVreg_EnableNonblockingOpen:int 2025-12-04T09:03:15.1421934Z parm: NVreg_CoherentGPUMemoryMode:charp 2025-12-04T09:03:15.1422201Z parm: NVreg_RegistryDwords:charp 2025-12-04T09:03:15.1422449Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-12-04T09:03:15.1422687Z parm: NVreg_RmMsg:charp 2025-12-04T09:03:15.1422905Z parm: NVreg_GpuBlacklist:charp 2025-12-04T09:03:15.1423142Z parm: NVreg_TemporaryFilePath:charp 2025-12-04T09:03:15.1423375Z parm: NVreg_ExcludedGpus:charp 2025-12-04T09:03:15.1423604Z parm: NVreg_DmaRemapPeerMmio:int 2025-12-04T09:03:15.1423842Z parm: NVreg_RmNvlinkBandwidth:charp 2025-12-04T09:03:15.1424112Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-12-04T09:03:15.1424362Z parm: NVreg_ImexChannelCount:int 2025-12-04T09:03:15.1424597Z parm: NVreg_CreateImexChannel0:int 2025-12-04T09:03:15.1424848Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-12-04T09:03:15.1425098Z parm: rm_firmware_active:charp 2025-12-04T09:03:15.1425308Z + HAS_NVIDIA_DRIVER=0 2025-12-04T09:03:15.1425518Z ++ command -v nvidia-smi 2025-12-04T09:03:15.1425708Z + '[' -x /usr/bin/nvidia-smi ']' 2025-12-04T09:03:15.1425909Z + set +e 2025-12-04T09:03:15.1426243Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-12-04T09:03:16.7621229Z + INSTALLED_DRIVER_VERSION=580.82.07 2025-12-04T09:03:16.7621574Z + NVIDIA_SMI_STATUS=0 2025-12-04T09:03:16.7622336Z + '[' 0 -ne 0 ']' 2025-12-04T09:03:16.7622538Z + '[' 580.82.07 '!=' 580.82.07 ']' 2025-12-04T09:03:16.7622784Z + HAS_NVIDIA_DRIVER=1 2025-12-04T09:03:16.7623189Z + echo 'NVIDIA driver (580.82.07) has already been installed. Skipping NVIDIA driver installation' 2025-12-04T09:03:16.7623627Z + set -e 2025-12-04T09:03:16.7623800Z + '[' 1 -eq 0 ']' 2025-12-04T09:03:16.7624155Z NVIDIA driver (580.82.07) has already been installed. Skipping NVIDIA driver installation 2025-12-04T09:03:16.7624578Z + post_install_nvidia_driver_common 2025-12-04T09:03:16.7627952Z + sudo modprobe nvidia 2025-12-04T09:03:16.8558767Z + echo 'After installing NVIDIA driver' 2025-12-04T09:03:16.8559422Z + lspci 2025-12-04T09:03:16.8559657Z After installing NVIDIA driver 2025-12-04T09:03:16.8730221Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-12-04T09:03:16.8730794Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-12-04T09:03:16.8731399Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-12-04T09:03:16.8731967Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-12-04T09:03:16.8732592Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-12-04T09:03:16.8733072Z 01:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8733457Z 02:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8733820Z 03:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8734166Z 03:00.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8734519Z 03:00.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8734899Z 03:00.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8735267Z 03:00.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8735566Z 03:00.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8736018Z 03:00.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8736481Z 03:00.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8736881Z 03:01.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8737182Z 03:01.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8737486Z 03:01.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8737779Z 03:01.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8738076Z 03:01.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8738374Z 03:01.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8738664Z 03:01.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8738965Z 03:01.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8739288Z 03:02.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8739585Z 03:02.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8740128Z 03:02.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8740436Z 03:02.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8740729Z 03:02.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8741026Z 03:02.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8741327Z 03:02.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8741624Z 03:02.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8741918Z 03:03.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8742215Z 03:03.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8742514Z 03:03.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8742802Z 03:03.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8743111Z 03:03.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8743413Z 03:03.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8743703Z 03:03.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8744001Z 03:03.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8744536Z 24:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8744857Z 25:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8745154Z 26:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8745453Z 26:00.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8745749Z 26:00.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8746040Z 26:00.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8746295Z 26:00.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8746538Z 26:00.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8746772Z 26:00.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8747012Z 26:00.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8747251Z 26:01.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8747728Z 27:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-12-04T09:03:16.8748062Z 30:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8748322Z 31:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8748567Z 32:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8748881Z 33:00.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-12-04T09:03:16.8749204Z 34:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:03:16.8749487Z 35:00.0 3D controller: NVIDIA Corporation AD104GL [L4] (rev a1) 2025-12-04T09:03:16.8749736Z + lsmod 2025-12-04T09:03:16.8762728Z Module Size Used by 2025-12-04T09:03:16.8763052Z nvidia_uvm 1925120 0 2025-12-04T09:03:16.8763313Z nvidia 14286848 1 nvidia_uvm 2025-12-04T09:03:16.8763587Z drm 602112 1 nvidia 2025-12-04T09:03:16.8763872Z drm_panel_orientation_quirks 32768 1 drm 2025-12-04T09:03:16.8764176Z backlight 24576 1 drm 2025-12-04T09:03:16.8764436Z i2c_core 110592 2 nvidia,drm 2025-12-04T09:03:16.8764698Z xt_conntrack 16384 1 2025-12-04T09:03:16.8764939Z nft_chain_nat 16384 3 2025-12-04T09:03:16.8765175Z xt_MASQUERADE 20480 1 2025-12-04T09:03:16.8765438Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-12-04T09:03:16.8765749Z nf_conntrack_netlink 57344 0 2025-12-04T09:03:16.8766117Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-12-04T09:03:16.8766518Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-12-04T09:03:16.8766816Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-12-04T09:03:16.8767083Z xfrm_user 57344 1 2025-12-04T09:03:16.8767317Z xfrm_algo 16384 1 xfrm_user 2025-12-04T09:03:16.8767580Z xt_addrtype 16384 2 2025-12-04T09:03:16.8767807Z nft_compat 20480 4 2025-12-04T09:03:16.8768103Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-12-04T09:03:16.8768486Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-12-04T09:03:16.8768844Z br_netfilter 36864 0 2025-12-04T09:03:16.8769100Z bridge 323584 1 br_netfilter 2025-12-04T09:03:16.8769368Z stp 16384 1 bridge 2025-12-04T09:03:16.8769626Z llc 16384 2 bridge,stp 2025-12-04T09:03:16.8769883Z overlay 167936 0 2025-12-04T09:03:16.8770097Z tls 139264 0 2025-12-04T09:03:16.8770313Z nls_ascii 16384 1 2025-12-04T09:03:16.8770529Z nls_cp437 20480 1 2025-12-04T09:03:16.8770737Z vfat 24576 1 2025-12-04T09:03:16.8770961Z fat 86016 1 vfat 2025-12-04T09:03:16.8771202Z sunrpc 700416 1 2025-12-04T09:03:16.8771410Z ena 184320 0 2025-12-04T09:03:16.8771618Z i8042 45056 0 2025-12-04T09:03:16.8771837Z serio 28672 3 i8042 2025-12-04T09:03:16.8772087Z ghash_clmulni_intel 16384 0 2025-12-04T09:03:16.8772447Z button 24576 0 2025-12-04T09:03:16.8772668Z sch_fq_codel 20480 9 2025-12-04T09:03:16.8773083Z dm_mod 188416 0 2025-12-04T09:03:16.8773309Z fuse 184320 1 2025-12-04T09:03:16.8773528Z loop 36864 0 2025-12-04T09:03:16.8773747Z configfs 57344 1 2025-12-04T09:03:16.8773957Z dmi_sysfs 20480 0 2025-12-04T09:03:16.8774197Z crc32_pclmul 16384 0 2025-12-04T09:03:16.8774421Z crc32c_intel 24576 0 2025-12-04T09:03:16.8774631Z efivarfs 24576 1 2025-12-04T09:03:16.8774814Z + modinfo nvidia 2025-12-04T09:03:16.8778830Z filename: /lib/modules/6.1.150-174.273.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-12-04T09:03:16.8779216Z import_ns: DMA_BUF 2025-12-04T09:03:16.8779411Z alias: char-major-195-* 2025-12-04T09:03:16.8779606Z version: 580.82.07 2025-12-04T09:03:16.8779961Z supported: external 2025-12-04T09:03:16.8780155Z license: Dual MIT/GPL 2025-12-04T09:03:16.8780368Z firmware: nvidia/580.82.07/gsp_tu10x.bin 2025-12-04T09:03:16.8780630Z firmware: nvidia/580.82.07/gsp_ga10x.bin 2025-12-04T09:03:16.8780866Z srcversion: BA7240A71DCF7DC6FE88C1D 2025-12-04T09:03:16.8781110Z alias: of:N*T*Cnvidia,tegra264-displayC* 2025-12-04T09:03:16.8781376Z alias: of:N*T*Cnvidia,tegra264-display 2025-12-04T09:03:16.8781638Z alias: of:N*T*Cnvidia,tegra234-displayC* 2025-12-04T09:03:16.8781894Z alias: of:N*T*Cnvidia,tegra234-display 2025-12-04T09:03:16.8782168Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-12-04T09:03:16.8782422Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-12-04T09:03:16.8782673Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-12-04T09:03:16.8782896Z depends: i2c-core,drm 2025-12-04T09:03:16.8783091Z retpoline: Y 2025-12-04T09:03:16.8783262Z name: nvidia 2025-12-04T09:03:16.8783527Z vermagic: 6.1.150-174.273.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-12-04T09:03:16.8783882Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-12-04T09:03:16.8784212Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-12-04T09:03:16.8784513Z parm: NVreg_ResmanDebugLevel:int 2025-12-04T09:03:16.8784740Z parm: NVreg_RmLogonRC:int 2025-12-04T09:03:16.8784962Z parm: NVreg_ModifyDeviceFiles:int 2025-12-04T09:03:16.8785196Z parm: NVreg_DeviceFileUID:int 2025-12-04T09:03:16.8785404Z parm: NVreg_DeviceFileGID:int 2025-12-04T09:03:16.8785631Z parm: NVreg_DeviceFileMode:int 2025-12-04T09:03:16.8785894Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-12-04T09:03:16.8786176Z parm: NVreg_UsePageAttributeTable:int 2025-12-04T09:03:16.8786418Z parm: NVreg_EnablePCIeGen3:int 2025-12-04T09:03:16.8786636Z parm: NVreg_EnableMSI:int 2025-12-04T09:03:16.8786860Z parm: NVreg_EnableStreamMemOPs:int 2025-12-04T09:03:16.8787137Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-12-04T09:03:16.8787429Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-12-04T09:03:16.8787699Z parm: NVreg_EnableS0ixPowerManagement:int 2025-12-04T09:03:16.8788002Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-12-04T09:03:16.8788302Z parm: NVreg_DynamicPowerManagement:int 2025-12-04T09:03:16.8788606Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-12-04T09:03:16.8788900Z parm: NVreg_EnableGpuFirmware:int 2025-12-04T09:03:16.8789145Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-12-04T09:03:16.8789419Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-12-04T09:03:16.8789698Z parm: NVreg_EnableUserNUMAManagement:int 2025-12-04T09:03:16.8789949Z parm: NVreg_MemoryPoolSize:int 2025-12-04T09:03:16.8790193Z parm: NVreg_KMallocHeapMaxSize:int 2025-12-04T09:03:16.8790444Z parm: NVreg_VMallocHeapMaxSize:int 2025-12-04T09:03:16.8790685Z parm: NVreg_IgnoreMMIOCheck:int 2025-12-04T09:03:16.8791008Z parm: NVreg_NvLinkDisable:int 2025-12-04T09:03:16.8791269Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-12-04T09:03:16.8791530Z parm: NVreg_RegisterPCIDriver:int 2025-12-04T09:03:16.8791801Z parm: NVreg_RegisterPlatformDeviceDriver:int 2025-12-04T09:03:16.8792069Z parm: NVreg_EnableResizableBar:int 2025-12-04T09:03:16.8792304Z parm: NVreg_EnableDbgBreakpoint:int 2025-12-04T09:03:16.8792560Z parm: NVreg_EnableNonblockingOpen:int 2025-12-04T09:03:16.8792822Z parm: NVreg_CoherentGPUMemoryMode:charp 2025-12-04T09:03:16.8793066Z parm: NVreg_RegistryDwords:charp 2025-12-04T09:03:16.8793315Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-12-04T09:03:16.8793558Z parm: NVreg_RmMsg:charp 2025-12-04T09:03:16.8793854Z parm: NVreg_GpuBlacklist:charp 2025-12-04T09:03:16.8794084Z parm: NVreg_TemporaryFilePath:charp 2025-12-04T09:03:16.8794327Z parm: NVreg_ExcludedGpus:charp 2025-12-04T09:03:16.8794561Z parm: NVreg_DmaRemapPeerMmio:int 2025-12-04T09:03:16.8794792Z parm: NVreg_RmNvlinkBandwidth:charp 2025-12-04T09:03:16.8795060Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-12-04T09:03:16.8795329Z parm: NVreg_ImexChannelCount:int 2025-12-04T09:03:16.8795564Z parm: NVreg_CreateImexChannel0:int 2025-12-04T09:03:16.8795830Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-12-04T09:03:16.8796074Z parm: rm_firmware_active:charp 2025-12-04T09:03:16.8796282Z + set +e 2025-12-04T09:03:16.8796419Z + nvidia-smi 2025-12-04T09:03:18.3192185Z Thu Dec 4 09:03:18 2025 2025-12-04T09:03:18.3192923Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:03:18.3193882Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:03:18.3195014Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:03:18.3196094Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:03:18.3196628Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:03:18.3197016Z | | | MIG M. | 2025-12-04T09:03:18.3197309Z |=========================================+========================+======================| 2025-12-04T09:03:18.3261323Z | 0 NVIDIA L4 Off | 00000000:35:00.0 Off | 0 | 2025-12-04T09:03:18.3261980Z | N/A 39C P0 30W / 72W | 0MiB / 23034MiB | 4% Default | 2025-12-04T09:03:18.3262384Z | | | N/A | 2025-12-04T09:03:18.3262809Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:03:18.3263116Z 2025-12-04T09:03:18.3263304Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:03:18.3263750Z | Processes: | 2025-12-04T09:03:18.3264214Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:03:18.3264654Z | ID ID Usage | 2025-12-04T09:03:18.3265029Z |=========================================================================================| 2025-12-04T09:03:18.3266281Z | No running processes found | 2025-12-04T09:03:18.3266707Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:03:18.6503416Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-12-04T09:03:20.0903510Z NVIDIA L4 2025-12-04T09:03:20.2698629Z + NVIDIA_SMI_STATUS=0 2025-12-04T09:03:20.2699284Z + '[' 0 -eq 0 ']' 2025-12-04T09:03:20.2699539Z + echo 'INFO: Ignoring allowed status 0' 2025-12-04T09:03:20.2699816Z + set -e 2025-12-04T09:03:20.2700203Z INFO: Ignoring allowed status 0 2025-12-04T09:03:20.2707613Z == Installing nvidia container toolkit for amzn2023 == 2025-12-04T09:03:20.2711371Z + sudo yum install -y yum-utils 2025-12-04T09:03:20.6963635Z Last metadata expiration check: 0:07:56 ago on Thu Dec 4 08:55:24 2025. 2025-12-04T09:03:20.7188721Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-12-04T09:03:20.7612824Z Dependencies resolved. 2025-12-04T09:03:20.7839361Z Nothing to do. 2025-12-04T09:03:20.7839611Z Complete! 2025-12-04T09:03:21.0626348Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-12-04T09:03:21.0626941Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:03:21.0628154Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:03:21.3490832Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:03:21.3954232Z + sudo yum install -y nvidia-container-toolkit-1.17.8 libnvidia-container-tools-1.17.8 libnvidia-container1-1.17.8 nvidia-container-toolkit-base-1.17.8 2025-12-04T09:03:21.8813892Z nvidia-container-toolkit 23 kB/s | 833 B 00:00 2025-12-04T09:03:21.9497660Z Dependencies resolved. 2025-12-04T09:03:21.9725297Z ================================================================================ 2025-12-04T09:03:21.9725741Z Package Arch Version Repository Size 2025-12-04T09:03:21.9726163Z ================================================================================ 2025-12-04T09:03:21.9726578Z Downgrading: 2025-12-04T09:03:21.9727053Z libnvidia-container-tools x86_64 1.17.8-1 nvidia-container-toolkit 40 k 2025-12-04T09:03:21.9727781Z libnvidia-container1 x86_64 1.17.8-1 nvidia-container-toolkit 1.0 M 2025-12-04T09:03:21.9728492Z nvidia-container-toolkit x86_64 1.17.8-1 nvidia-container-toolkit 1.2 M 2025-12-04T09:03:21.9729312Z nvidia-container-toolkit-base x86_64 1.17.8-1 nvidia-container-toolkit 5.8 M 2025-12-04T09:03:21.9730053Z 2025-12-04T09:03:21.9730238Z Transaction Summary 2025-12-04T09:03:21.9730711Z ================================================================================ 2025-12-04T09:03:21.9731327Z Downgrade 4 Packages 2025-12-04T09:03:21.9731546Z 2025-12-04T09:03:21.9731700Z Total download size: 8.0 M 2025-12-04T09:03:21.9732284Z Downloading Packages: 2025-12-04T09:03:22.0003718Z (1/4): libnvidia-container-tools-1.17.8-1.x86_6 1.5 MB/s | 40 kB 00:00 2025-12-04T09:03:22.0149415Z (2/4): libnvidia-container1-1.17.8-1.x86_64.rpm 24 MB/s | 1.0 MB 00:00 2025-12-04T09:03:22.0244779Z (3/4): nvidia-container-toolkit-1.17.8-1.x86_64 25 MB/s | 1.2 MB 00:00 2025-12-04T09:03:22.0585023Z (4/4): nvidia-container-toolkit-base-1.17.8-1.x 100 MB/s | 5.8 MB 00:00 2025-12-04T09:03:22.0591711Z -------------------------------------------------------------------------------- 2025-12-04T09:03:22.0594775Z Total 93 MB/s | 8.0 MB 00:00 2025-12-04T09:03:22.0597276Z Running transaction check 2025-12-04T09:03:22.0712492Z Transaction check succeeded. 2025-12-04T09:03:22.0713069Z Running transaction test 2025-12-04T09:03:22.1138364Z Transaction test succeeded. 2025-12-04T09:03:22.1141120Z Running transaction 2025-12-04T09:03:22.6113109Z Preparing : 1/1 2025-12-04T09:03:22.7129508Z Downgrading : nvidia-container-toolkit-base-1.17.8-1.x86_64 1/8 2025-12-04T09:03:22.7261242Z Downgrading : libnvidia-container1-1.17.8-1.x86_64 2/8 2025-12-04T09:03:22.7715912Z Running scriptlet: libnvidia-container1-1.17.8-1.x86_64 2/8 2025-12-04T09:03:22.8652794Z Downgrading : libnvidia-container-tools-1.17.8-1.x86_64 3/8 2025-12-04T09:03:22.8889207Z Downgrading : nvidia-container-toolkit-1.17.8-1.x86_64 4/8 2025-12-04T09:03:22.9389694Z Running scriptlet: nvidia-container-toolkit-1.17.8-1.x86_64 4/8 2025-12-04T09:03:22.9443946Z Running scriptlet: nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:03:22.9446455Z Cleanup : nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:03:22.9701965Z Running scriptlet: nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:03:22.9752106Z Running scriptlet: libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:03:22.9753335Z Cleanup : libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:03:23.0001823Z Running scriptlet: libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:03:23.0058034Z Running scriptlet: libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:03:23.0058937Z Cleanup : libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:03:23.0533799Z Running scriptlet: libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:03:23.0588428Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:03:23.0589394Z Cleanup : nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:03:23.0739408Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:03:23.1174032Z Running scriptlet: nvidia-container-toolkit-1.17.8-1.x86_64 8/8 2025-12-04T09:04:10.0738325Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:04:10.0740075Z Verifying : libnvidia-container-tools-1.17.8-1.x86_64 1/8 2025-12-04T09:04:10.0740672Z Verifying : libnvidia-container-tools-1.18.1-1.x86_64 2/8 2025-12-04T09:04:10.0741175Z Verifying : libnvidia-container1-1.17.8-1.x86_64 3/8 2025-12-04T09:04:10.0741707Z Verifying : libnvidia-container1-1.18.1-1.x86_64 4/8 2025-12-04T09:04:10.0742223Z Verifying : nvidia-container-toolkit-1.17.8-1.x86_64 5/8 2025-12-04T09:04:10.0742722Z Verifying : nvidia-container-toolkit-1.18.1-1.x86_64 6/8 2025-12-04T09:04:10.0743212Z Verifying : nvidia-container-toolkit-base-1.17.8-1.x86_64 7/8 2025-12-04T09:04:10.2314160Z Verifying : nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8================================================================================ 2025-12-04T09:04:10.2314766Z WARNING: 2025-12-04T09:04:10.2315035Z A newer release of "Amazon Linux" is available. 2025-12-04T09:04:10.2315276Z 2025-12-04T09:04:10.2315366Z Available Versions: 2025-12-04T09:04:10.2315525Z 2025-12-04T09:04:10.2315625Z Version 2023.9.20250929: 2025-12-04T09:04:10.2315945Z Run the following command to upgrade to 2023.9.20250929: 2025-12-04T09:04:10.2316198Z 2025-12-04T09:04:10.2316324Z dnf upgrade --releasever=2023.9.20250929 2025-12-04T09:04:10.2316536Z 2025-12-04T09:04:10.2316618Z Release notes: 2025-12-04T09:04:10.2317041Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20250929.html 2025-12-04T09:04:10.2317420Z 2025-12-04T09:04:10.2317511Z Version 2023.9.20251014: 2025-12-04T09:04:10.2317804Z Run the following command to upgrade to 2023.9.20251014: 2025-12-04T09:04:10.2318070Z 2025-12-04T09:04:10.2318183Z dnf upgrade --releasever=2023.9.20251014 2025-12-04T09:04:10.2318401Z 2025-12-04T09:04:10.2318480Z Release notes: 2025-12-04T09:04:10.2318871Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251014.html 2025-12-04T09:04:10.2319244Z 2025-12-04T09:04:10.2319330Z Version 2023.9.20251020: 2025-12-04T09:04:10.2319919Z Run the following command to upgrade to 2023.9.20251020: 2025-12-04T09:04:10.2320138Z 2025-12-04T09:04:10.2320237Z dnf upgrade --releasever=2023.9.20251020 2025-12-04T09:04:10.2320402Z 2025-12-04T09:04:10.2320477Z Release notes: 2025-12-04T09:04:10.2320790Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251020.html 2025-12-04T09:04:10.2321094Z 2025-12-04T09:04:10.2321162Z Version 2023.9.20251027: 2025-12-04T09:04:10.2321406Z Run the following command to upgrade to 2023.9.20251027: 2025-12-04T09:04:10.2321603Z 2025-12-04T09:04:10.2321687Z dnf upgrade --releasever=2023.9.20251027 2025-12-04T09:04:10.2321861Z 2025-12-04T09:04:10.2321927Z Release notes: 2025-12-04T09:04:10.2322235Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251027.html 2025-12-04T09:04:10.2322703Z 2025-12-04T09:04:10.2322775Z Version 2023.9.20251105: 2025-12-04T09:04:10.2323003Z Run the following command to upgrade to 2023.9.20251105: 2025-12-04T09:04:10.2323204Z 2025-12-04T09:04:10.2323294Z dnf upgrade --releasever=2023.9.20251105 2025-12-04T09:04:10.2323454Z 2025-12-04T09:04:10.2323524Z Release notes: 2025-12-04T09:04:10.2323819Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251105.html 2025-12-04T09:04:10.2324114Z 2025-12-04T09:04:10.2324178Z Version 2023.9.20251110: 2025-12-04T09:04:10.2324408Z Run the following command to upgrade to 2023.9.20251110: 2025-12-04T09:04:10.2324601Z 2025-12-04T09:04:10.2324687Z dnf upgrade --releasever=2023.9.20251110 2025-12-04T09:04:10.2324849Z 2025-12-04T09:04:10.2324911Z Release notes: 2025-12-04T09:04:10.2325210Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251110.html 2025-12-04T09:04:10.2325496Z 2025-12-04T09:04:10.2325571Z Version 2023.9.20251117: 2025-12-04T09:04:10.2325794Z Run the following command to upgrade to 2023.9.20251117: 2025-12-04T09:04:10.2325995Z 2025-12-04T09:04:10.2326081Z dnf upgrade --releasever=2023.9.20251117 2025-12-04T09:04:10.2326245Z 2025-12-04T09:04:10.2326310Z Release notes: 2025-12-04T09:04:10.2326607Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251117.html 2025-12-04T09:04:10.2326896Z 2025-12-04T09:04:10.2326981Z ================================================================================ 2025-12-04T09:04:10.2788121Z 2025-12-04T09:04:10.2788246Z 2025-12-04T09:04:10.2788325Z Downgraded: 2025-12-04T09:04:10.2788671Z libnvidia-container-tools-1.17.8-1.x86_64 2025-12-04T09:04:10.2789198Z libnvidia-container1-1.17.8-1.x86_64 2025-12-04T09:04:10.2789711Z nvidia-container-toolkit-1.17.8-1.x86_64 2025-12-04T09:04:10.2790260Z nvidia-container-toolkit-base-1.17.8-1.x86_64 2025-12-04T09:04:10.2790575Z 2025-12-04T09:04:10.2790657Z Complete! 2025-12-04T09:04:10.3502901Z + sudo systemctl restart docker 2025-12-04T09:04:23.5999341Z Thu Dec 4 09:04:23 2025 2025-12-04T09:04:23.6000112Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:04:23.6001030Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:04:23.6001883Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:04:23.6002763Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:04:23.6003664Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:04:23.6003990Z | | | MIG M. | 2025-12-04T09:04:23.6004272Z |=========================================+========================+======================| 2025-12-04T09:04:23.6074247Z | 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 | 2025-12-04T09:04:23.6075031Z | N/A 39C P0 30W / 72W | 0MiB / 23034MiB | 4% Default | 2025-12-04T09:04:23.6075429Z | | | N/A | 2025-12-04T09:04:23.6075807Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:04:23.6076096Z 2025-12-04T09:04:23.6076257Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:04:23.6076655Z | Processes: | 2025-12-04T09:04:23.6077060Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:04:23.6077425Z | ID ID Usage | 2025-12-04T09:04:23.6077955Z |=========================================================================================| 2025-12-04T09:04:23.6078861Z | No running processes found | 2025-12-04T09:04:23.6079318Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:04:23.7681084Z Unable to find image 'public.ecr.aws/docker/library/python:3.13' locally 2025-12-04T09:04:24.0131226Z 3.13: Pulling from docker/library/python 2025-12-04T09:04:24.1011327Z 53c88f1dfeb7: Pulling fs layer 2025-12-04T09:04:24.1012122Z eae668646f44: Pulling fs layer 2025-12-04T09:04:24.1012643Z ff2e6e687b6c: Pulling fs layer 2025-12-04T09:04:24.1013002Z 7c40a3faff76: Pulling fs layer 2025-12-04T09:04:24.1013361Z 967a3b1c8fef: Pulling fs layer 2025-12-04T09:04:24.1013746Z a64e1a44f22a: Pulling fs layer 2025-12-04T09:04:24.1026209Z 52655f8a5bcc: Pulling fs layer 2025-12-04T09:04:24.1026598Z 967a3b1c8fef: Waiting 2025-12-04T09:04:24.1026896Z a64e1a44f22a: Waiting 2025-12-04T09:04:24.1027159Z 52655f8a5bcc: Waiting 2025-12-04T09:04:24.1027507Z 7c40a3faff76: Waiting 2025-12-04T09:04:24.2057760Z eae668646f44: Verifying Checksum 2025-12-04T09:04:24.2058115Z eae668646f44: Download complete 2025-12-04T09:04:24.2353958Z 53c88f1dfeb7: Download complete 2025-12-04T09:04:24.3013681Z ff2e6e687b6c: Verifying Checksum 2025-12-04T09:04:24.3014072Z ff2e6e687b6c: Download complete 2025-12-04T09:04:24.3038135Z 967a3b1c8fef: Verifying Checksum 2025-12-04T09:04:24.3038421Z 967a3b1c8fef: Download complete 2025-12-04T09:04:24.3336455Z 52655f8a5bcc: Verifying Checksum 2025-12-04T09:04:24.3336775Z 52655f8a5bcc: Download complete 2025-12-04T09:04:24.4251276Z a64e1a44f22a: Verifying Checksum 2025-12-04T09:04:24.4251634Z a64e1a44f22a: Download complete 2025-12-04T09:04:25.1969089Z 7c40a3faff76: Verifying Checksum 2025-12-04T09:04:25.1969418Z 7c40a3faff76: Download complete 2025-12-04T09:04:25.5510484Z 53c88f1dfeb7: Pull complete 2025-12-04T09:04:26.0741626Z eae668646f44: Pull complete 2025-12-04T09:04:27.8256160Z ff2e6e687b6c: Pull complete 2025-12-04T09:04:32.6149523Z 7c40a3faff76: Pull complete 2025-12-04T09:04:32.8326397Z 967a3b1c8fef: Pull complete 2025-12-04T09:04:33.5202357Z a64e1a44f22a: Pull complete 2025-12-04T09:04:33.7236331Z 52655f8a5bcc: Pull complete 2025-12-04T09:04:33.7713098Z Digest: sha256:3f986299a7b8b44b0d8cf9bda2b22361ce5c3058ef5d7cb17fb7452506680ab0 2025-12-04T09:04:33.7887185Z Status: Downloaded newer image for public.ecr.aws/docker/library/python:3.13 2025-12-04T09:04:41.4760867Z Thu Dec 4 09:04:41 2025 2025-12-04T09:04:41.4761544Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:04:41.4762486Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:04:41.4763083Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:04:41.4763713Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:04:41.4764637Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:04:41.4765178Z | | | MIG M. | 2025-12-04T09:04:41.4765557Z |=========================================+========================+======================| 2025-12-04T09:04:41.4868474Z | 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 | 2025-12-04T09:04:41.4869257Z | N/A 38C P8 13W / 72W | 0MiB / 23034MiB | 0% Default | 2025-12-04T09:04:41.4869829Z | | | N/A | 2025-12-04T09:04:41.4870473Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:04:41.4871589Z 2025-12-04T09:04:41.4872198Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:04:41.4872986Z | Processes: | 2025-12-04T09:04:41.4873621Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:04:41.4874127Z | ID ID Usage | 2025-12-04T09:04:41.4874535Z |=========================================================================================| 2025-12-04T09:04:41.4877001Z | No running processes found | 2025-12-04T09:04:41.4877855Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:04:42.6688076Z Command completed after 1 attempt(s). 2025-12-04T09:04:42.6774225Z Prepare all required actions 2025-12-04T09:04:42.6798223Z ##[group]Run ./.github/actions/get-workflow-job-id 2025-12-04T09:04:42.6798482Z with: 2025-12-04T09:04:42.6799024Z github-token: *** 2025-12-04T09:04:42.6799214Z env: 2025-12-04T09:04:42.6799368Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:04:42.6799559Z HAS_NVIDIA_GPU: true 2025-12-04T09:04:42.6799783Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:04:42.6800039Z ##[endgroup] 2025-12-04T09:04:42.6813293Z ##[group]Run set -eux 2025-12-04T09:04:42.6813483Z set -eux 2025-12-04T09:04:42.6813795Z python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2025-12-04T09:04:42.6825883Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:04:42.6826157Z env: 2025-12-04T09:04:42.6826326Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:04:42.6826517Z HAS_NVIDIA_GPU: true 2025-12-04T09:04:42.6826791Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:04:42.6827180Z GITHUB_TOKEN: *** 2025-12-04T09:04:42.6827352Z ##[endgroup] 2025-12-04T09:04:42.6861649Z + python3 .github/scripts/get_workflow_job_id.py 19922768520 i-004b897e3464f6ebc 2025-12-04T09:04:43.9887226Z Setting output job-id=57116084853 2025-12-04T09:04:43.9887938Z Setting output job-name=linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:04:43.9998659Z ##[group]Run python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84 2025-12-04T09:04:43.9999260Z python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84 2025-12-04T09:04:43.9999940Z python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 & 2025-12-04T09:04:44.0000543Z echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:04:44.0009231Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:04:44.0009526Z env: 2025-12-04T09:04:44.0009693Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:04:44.0009901Z HAS_NVIDIA_GPU: true 2025-12-04T09:04:44.0010145Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:04:44.0010406Z JOB_ID: 57116084853 2025-12-04T09:04:44.0010884Z JOB_NAME: linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:04:44.0011364Z WORKFLOW_NAME: trunk 2025-12-04T09:04:44.0011542Z WORKFLOW_RUN_ID: 19922768520 2025-12-04T09:04:44.0011745Z MONITOR_LOG_INTERVAL: 5 2025-12-04T09:04:44.0011942Z MONITOR_DATA_COLLECT_INTERVAL: 1 2025-12-04T09:04:44.0012141Z ##[endgroup] 2025-12-04T09:04:44.2819482Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:04:44.6082634Z Collecting psutil==5.9.8 2025-12-04T09:04:44.6242525Z Downloading psutil-5.9.8-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288 kB) 2025-12-04T09:04:44.6905870Z Collecting dataclasses_json==0.6.7 2025-12-04T09:04:44.6941485Z Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB) 2025-12-04T09:04:44.7195894Z Collecting nvidia-ml-py==11.525.84 2025-12-04T09:04:44.7235698Z Downloading nvidia_ml_py-11.525.84-py3-none-any.whl (34 kB) 2025-12-04T09:04:44.7533743Z Collecting typing-inspect<1,>=0.4.0 2025-12-04T09:04:44.7571236Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-12-04T09:04:44.8523083Z Collecting marshmallow<4.0.0,>=3.18.0 2025-12-04T09:04:44.8560516Z Downloading marshmallow-3.26.1-py3-none-any.whl (50 kB) 2025-12-04T09:04:44.9070111Z Collecting packaging>=17.0 2025-12-04T09:04:44.9108494Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-12-04T09:04:44.9322017Z Collecting mypy-extensions>=0.3.0 2025-12-04T09:04:44.9360184Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-12-04T09:04:44.9782518Z Collecting typing-extensions>=3.7.4 2025-12-04T09:04:44.9818802Z Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB) 2025-12-04T09:04:45.0737555Z Installing collected packages: typing-extensions, packaging, mypy-extensions, typing-inspect, marshmallow, psutil, nvidia-ml-py, dataclasses-json 2025-12-04T09:04:45.3206410Z Successfully installed dataclasses-json-0.6.7 marshmallow-3.26.1 mypy-extensions-1.1.0 nvidia-ml-py-11.525.84 packaging-25.0 psutil-5.9.8 typing-extensions-4.15.0 typing-inspect-0.9.0 2025-12-04T09:04:45.4819138Z Prepare all required actions 2025-12-04T09:04:45.4819493Z Getting action download info 2025-12-04T09:04:45.6371846Z Download action repository 'seemethere/download-artifact-s3@v4' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2025-12-04T09:04:45.9528978Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-12-04T09:04:46.3529188Z ##[group]Run ./.github/actions/download-build-artifacts 2025-12-04T09:04:46.3529465Z with: 2025-12-04T09:04:46.3529652Z name: linux-jammy-cuda12.8-py3.10-gcc11 2025-12-04T09:04:46.3529902Z s3-bucket: gha-artifacts 2025-12-04T09:04:46.3530104Z env: 2025-12-04T09:04:46.3530253Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:04:46.3530444Z HAS_NVIDIA_GPU: true 2025-12-04T09:04:46.3530675Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:04:46.3530921Z ##[endgroup] 2025-12-04T09:04:46.3555616Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T09:04:46.3555864Z with: 2025-12-04T09:04:46.3556094Z name: linux-jammy-cuda12.8-py3.10-gcc11 2025-12-04T09:04:46.3556334Z s3-bucket: gha-artifacts 2025-12-04T09:04:46.3556527Z region: us-east-1 2025-12-04T09:04:46.3556690Z env: 2025-12-04T09:04:46.3556841Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:04:46.3557019Z HAS_NVIDIA_GPU: true 2025-12-04T09:04:46.3557252Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:04:46.3557514Z ##[endgroup] 2025-12-04T09:04:46.7828399Z (node:58895) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T09:04:46.7828906Z 2025-12-04T09:04:46.7829086Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T09:04:46.7829641Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T09:04:46.7830166Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T09:04:47.0519916Z Found 1 objects with prefix pytorch/pytorch/19922768520/linux-jammy-cuda12.8-py3.10-gcc11/ 2025-12-04T09:04:47.0520619Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T09:04:55.3064598Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T09:04:55.3069743Z Artifact download has finished successfully 2025-12-04T09:04:55.3346978Z ##[group]Run unzip -o artifacts.zip 2025-12-04T09:04:55.3347244Z unzip -o artifacts.zip 2025-12-04T09:04:55.3355632Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:04:55.3355923Z env: 2025-12-04T09:04:55.3356081Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:04:55.3356276Z HAS_NVIDIA_GPU: true 2025-12-04T09:04:55.3356740Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:04:55.3357003Z ##[endgroup] 2025-12-04T09:04:55.3434508Z Archive: artifacts.zip 2025-12-04T09:04:55.3435664Z creating: dist/ 2025-12-04T09:04:55.3552135Z inflating: dist/.ninja_log 2025-12-04T09:04:57.3365096Z inflating: dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T09:04:57.3365585Z creating: build/ 2025-12-04T09:04:57.3365841Z creating: build/custom_test_artifacts/ 2025-12-04T09:04:57.3366219Z creating: build/custom_test_artifacts/custom-op-build/ 2025-12-04T09:04:57.3366684Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/ 2025-12-04T09:04:57.3367246Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:04:57.3374250Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:04:57.3374896Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/ 2025-12-04T09:04:57.3375550Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:04:57.3376231Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:04:57.3377204Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:04:57.3378488Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:04:57.3380067Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:04:57.3381191Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:04:57.3381963Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:04:57.3382630Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:04:57.3384173Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:04:57.3385900Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:04:57.3386875Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:04:57.3388648Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:04:57.3390611Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:04:57.3391245Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:04:57.3391797Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:04:57.3439327Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:04:57.3487905Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:04:57.3488888Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:04:57.3539753Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:04:57.3540751Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:04:57.3541760Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:04:57.3542838Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:04:57.3543896Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:04:57.3545166Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:04:57.3546208Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:04:57.3547232Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:04:57.3548218Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:04:57.3549166Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:04:57.3550089Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:04:57.3551002Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:04:57.3551896Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:04:57.3552824Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:04:57.3553937Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:04:57.3616510Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:04:57.3617479Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:04:57.3680042Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:04:57.3680825Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:04:57.3681424Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:04:57.3682043Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/cmake.check_cache 2025-12-04T09:04:57.3682674Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/ 2025-12-04T09:04:57.3683366Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.ts 2025-12-04T09:04:57.3684152Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.make 2025-12-04T09:04:57.3684915Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/depend.make 2025-12-04T09:04:57.3685622Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/link.txt 2025-12-04T09:04:57.3686338Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/cmake_clean.cmake 2025-12-04T09:04:57.3687114Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/build.make 2025-12-04T09:04:57.3687913Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/DependInfo.cmake 2025-12-04T09:04:57.3688712Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/flags.make 2025-12-04T09:04:57.3689531Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/progress.make 2025-12-04T09:04:57.3707664Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o.d 2025-12-04T09:04:57.3869216Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o 2025-12-04T09:04:57.3869853Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/ 2025-12-04T09:04:57.3870543Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.ts 2025-12-04T09:04:57.3871532Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.make 2025-12-04T09:04:57.3872238Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/depend.make 2025-12-04T09:04:57.3872785Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/link.txt 2025-12-04T09:04:57.3873352Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/cmake_clean.cmake 2025-12-04T09:04:57.3875007Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/build.make 2025-12-04T09:04:57.3875779Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/DependInfo.cmake 2025-12-04T09:04:57.3876535Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/flags.make 2025-12-04T09:04:57.3877337Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/progress.make 2025-12-04T09:04:57.3895525Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o.d 2025-12-04T09:04:57.3961098Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o 2025-12-04T09:04:57.3962367Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:04:57.3963113Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:04:57.3963751Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/progress.marks 2025-12-04T09:04:57.3964507Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile2 2025-12-04T09:04:57.3966137Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile.cmake 2025-12-04T09:04:57.3966695Z inflating: build/custom_test_artifacts/custom-op-build/detect_cuda_version.cc 2025-12-04T09:04:57.3969245Z inflating: build/custom_test_artifacts/custom-op-build/CMakeCache.txt 2025-12-04T09:04:57.3970035Z inflating: build/custom_test_artifacts/custom-op-build/Makefile 2025-12-04T09:04:57.3970852Z inflating: build/custom_test_artifacts/custom-op-build/cmake_install.cmake 2025-12-04T09:04:57.4107281Z inflating: build/custom_test_artifacts/custom-op-build/libcustom_ops.so 2025-12-04T09:04:57.4151210Z inflating: build/custom_test_artifacts/custom-op-build/test_custom_ops 2025-12-04T09:04:57.4151785Z creating: build/custom_test_artifacts/jit-hook-build/ 2025-12-04T09:04:57.4152297Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/ 2025-12-04T09:04:57.4152913Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:04:57.4159110Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:04:57.4159842Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/ 2025-12-04T09:04:57.4160569Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:04:57.4161329Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:04:57.4162285Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:04:57.4163436Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:04:57.4165192Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:04:57.4165989Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:04:57.4166658Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:04:57.4167312Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:04:57.4169352Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:04:57.4170785Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:04:57.4171875Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:04:57.4173747Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:04:57.4175525Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:04:57.4176106Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:04:57.4176611Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:04:57.4224268Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:04:57.4272714Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:04:57.4273651Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:04:57.4324656Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:04:57.4325691Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:04:57.4326699Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:04:57.4327714Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:04:57.4328699Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:04:57.4329689Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:04:57.4330645Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:04:57.4331604Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:04:57.4332602Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:04:57.4333363Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:04:57.4334088Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:04:57.4334791Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:04:57.4335511Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:04:57.4336217Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:04:57.4338657Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:04:57.4401003Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:04:57.4401753Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:04:57.4464417Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:04:57.4465118Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:04:57.4465643Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:04:57.4466399Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/cmake.check_cache 2025-12-04T09:04:57.4466971Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/ 2025-12-04T09:04:57.4467634Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.ts 2025-12-04T09:04:57.4468386Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.make 2025-12-04T09:04:57.4469100Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/depend.make 2025-12-04T09:04:57.4469757Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/link.txt 2025-12-04T09:04:57.4470441Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/cmake_clean.cmake 2025-12-04T09:04:57.4471141Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/build.make 2025-12-04T09:04:57.4471838Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/DependInfo.cmake 2025-12-04T09:04:57.4472596Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/flags.make 2025-12-04T09:04:57.4473381Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/progress.make 2025-12-04T09:04:57.4491445Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o.d 2025-12-04T09:04:57.4542558Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o 2025-12-04T09:04:57.4543450Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:04:57.4544234Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:04:57.4544948Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/progress.marks 2025-12-04T09:04:57.4545771Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile2 2025-12-04T09:04:57.4547278Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile.cmake 2025-12-04T09:04:57.4547947Z inflating: build/custom_test_artifacts/jit-hook-build/detect_cuda_version.cc 2025-12-04T09:04:57.4550415Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeCache.txt 2025-12-04T09:04:57.4551198Z inflating: build/custom_test_artifacts/jit-hook-build/Makefile 2025-12-04T09:04:57.4552065Z inflating: build/custom_test_artifacts/jit-hook-build/cmake_install.cmake 2025-12-04T09:04:57.4583041Z inflating: build/custom_test_artifacts/jit-hook-build/test_jit_hooks 2025-12-04T09:04:57.4583581Z creating: build/custom_test_artifacts/custom-backend-build/ 2025-12-04T09:04:57.4584084Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/ 2025-12-04T09:04:57.4584682Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:04:57.4590826Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:04:57.4591527Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/ 2025-12-04T09:04:57.4592285Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:04:57.4592958Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:04:57.4593605Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:04:57.4594789Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:04:57.4596511Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:04:57.4597411Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:04:57.4598446Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:04:57.4599211Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:04:57.4600609Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:04:57.4602142Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:04:57.4603177Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:04:57.4604979Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:04:57.4606791Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:04:57.4607565Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:04:57.4608249Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:04:57.4655718Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:04:57.4704007Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:04:57.4704999Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:04:57.4755686Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:04:57.4756730Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:04:57.4757795Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:04:57.4758878Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:04:57.4759922Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:04:57.4760937Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:04:57.4762189Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:04:57.4763171Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:04:57.4764091Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:04:57.4764957Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:04:57.4765800Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:04:57.4766632Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:04:57.4767472Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:04:57.4768290Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:04:57.4769806Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:04:57.4832216Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:04:57.4833015Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:04:57.4895623Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:04:57.4896371Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:04:57.4896967Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:04:57.4897571Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/cmake.check_cache 2025-12-04T09:04:57.4898202Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/ 2025-12-04T09:04:57.4898921Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.ts 2025-12-04T09:04:57.4899742Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.make 2025-12-04T09:04:57.4900526Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/depend.make 2025-12-04T09:04:57.4901426Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/link.txt 2025-12-04T09:04:57.4902246Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/cmake_clean.cmake 2025-12-04T09:04:57.4902945Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/build.make 2025-12-04T09:04:57.4903603Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/DependInfo.cmake 2025-12-04T09:04:57.4904271Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/flags.make 2025-12-04T09:04:57.4904935Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/progress.make 2025-12-04T09:04:57.4909037Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o.d 2025-12-04T09:04:57.5005841Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o 2025-12-04T09:04:57.5006638Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/ 2025-12-04T09:04:57.5007408Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.ts 2025-12-04T09:04:57.5008255Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.make 2025-12-04T09:04:57.5009066Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/depend.make 2025-12-04T09:04:57.5009821Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/link.txt 2025-12-04T09:04:57.5010605Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/cmake_clean.cmake 2025-12-04T09:04:57.5011388Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/build.make 2025-12-04T09:04:57.5012345Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/DependInfo.cmake 2025-12-04T09:04:57.5013145Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/flags.make 2025-12-04T09:04:57.5013954Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/progress.make 2025-12-04T09:04:57.5031352Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o.d 2025-12-04T09:04:57.5076198Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o 2025-12-04T09:04:57.5077282Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:04:57.5078032Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:04:57.5078712Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/progress.marks 2025-12-04T09:04:57.5079342Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile2 2025-12-04T09:04:57.5081135Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile.cmake 2025-12-04T09:04:57.5081756Z inflating: build/custom_test_artifacts/custom-backend-build/detect_cuda_version.cc 2025-12-04T09:04:57.5083935Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeCache.txt 2025-12-04T09:04:57.5084901Z inflating: build/custom_test_artifacts/custom-backend-build/Makefile 2025-12-04T09:04:57.5085797Z inflating: build/custom_test_artifacts/custom-backend-build/cmake_install.cmake 2025-12-04T09:04:57.5166633Z inflating: build/custom_test_artifacts/custom-backend-build/libcustom_backend.so 2025-12-04T09:04:57.5197748Z inflating: build/custom_test_artifacts/custom-backend-build/test_custom_backend 2025-12-04T09:04:57.5198286Z creating: build/lib/ 2025-12-04T09:04:57.5265657Z inflating: build/lib/libprotobuf-lite.a 2025-12-04T09:04:57.5622535Z inflating: build/lib/libprotobuf.a 2025-12-04T09:04:57.6022493Z inflating: build/lib/libprotoc.a 2025-12-04T09:04:57.6030489Z inflating: build/lib/libpthreadpool.a 2025-12-04T09:04:57.6037204Z inflating: build/lib/libcpuinfo.a 2025-12-04T09:04:57.6043897Z inflating: build/lib/libcpuinfo_internals.a 2025-12-04T09:04:57.6044700Z inflating: build/lib/libclog.a 2025-12-04T09:04:57.6060286Z inflating: build/lib/libpytorch_qnnpack.a 2025-12-04T09:04:57.6062491Z inflating: build/lib/libnnpack_reference_layers.a 2025-12-04T09:04:57.6077167Z inflating: build/lib/libnnpack.a 2025-12-04T09:04:57.6224654Z inflating: build/lib/libmicrokernels-prod.a 2025-12-04T09:04:57.6906466Z inflating: build/lib/libmicrokernels-all.a 2025-12-04T09:04:57.6962714Z inflating: build/lib/libgtest.a 2025-12-04T09:04:57.6976751Z inflating: build/lib/libgmock.a 2025-12-04T09:04:57.6977504Z inflating: build/lib/libgtest_main.a 2025-12-04T09:04:57.6978410Z inflating: build/lib/libgmock_main.a 2025-12-04T09:04:57.7050631Z inflating: build/lib/libXNNPACK.a 2025-12-04T09:04:57.7112377Z inflating: build/lib/libbenchmark.a 2025-12-04T09:04:57.7113063Z inflating: build/lib/libbenchmark_main.a 2025-12-04T09:04:57.7114048Z inflating: build/lib/libjitprofiling.a 2025-12-04T09:04:57.7120906Z inflating: build/lib/libittnotify.a 2025-12-04T09:04:57.7175585Z inflating: build/lib/libasmjit.a 2025-12-04T09:04:57.8114384Z inflating: build/lib/libfbgemm.a 2025-12-04T09:04:57.8139002Z inflating: build/lib/libtensorpipe_uv.a 2025-12-04T09:04:57.8578613Z inflating: build/lib/libtensorpipe.a 2025-12-04T09:04:57.8774947Z inflating: build/lib/libtensorpipe_cuda.a 2025-12-04T09:04:57.8883039Z inflating: build/lib/libgloo.a 2025-12-04T09:04:57.8921443Z inflating: build/lib/libonnx_proto.a 2025-12-04T09:04:57.9278444Z inflating: build/lib/libgloo_cuda.a 2025-12-04T09:04:57.9857758Z inflating: build/lib/libonnx.a 2025-12-04T09:04:57.9873703Z inflating: build/lib/libfmt.a 2025-12-04T09:04:58.8094720Z inflating: build/lib/libdnnl.a 2025-12-04T09:04:58.8474796Z inflating: build/lib/libkineto.a 2025-12-04T09:04:58.8567491Z inflating: build/lib/libc10.so 2025-12-04T09:04:58.8607334Z inflating: build/lib/libc10_cuda.so 2025-12-04T09:04:58.8609240Z inflating: build/lib/libcaffe2_nvrtc.so 2025-12-04T09:04:58.8610790Z inflating: build/lib/libtorch_global_deps.so 2025-12-04T09:05:01.3392371Z inflating: build/lib/libtorch_cpu.so 2025-12-04T09:05:01.4022966Z inflating: build/lib/libtorch_nvshmem.so 2025-12-04T09:05:03.7922347Z inflating: build/lib/libtorch_cuda.so 2025-12-04T09:05:03.7923558Z inflating: build/lib/libtorch.so 2025-12-04T09:05:03.7965306Z inflating: build/lib/libtorch_cuda_linalg.so 2025-12-04T09:05:03.8022899Z inflating: build/lib/libtorchbind_test.so 2025-12-04T09:05:03.8038706Z inflating: build/lib/libjitbackend_test.so 2025-12-04T09:05:03.8058461Z inflating: build/lib/libbackend_with_compiler.so 2025-12-04T09:05:03.8080181Z inflating: build/lib/libaoti_custom_ops.so 2025-12-04T09:05:03.8082658Z inflating: build/lib/libc10d_cuda_test.so 2025-12-04T09:05:03.8086598Z inflating: build/lib/libshm.so 2025-12-04T09:05:03.9977248Z inflating: build/lib/libtorch_python.so 2025-12-04T09:05:04.0006047Z inflating: build/lib/libnnapi_backend.so 2025-12-04T09:05:04.0006430Z creating: build/bin/ 2025-12-04T09:05:04.0369962Z inflating: build/bin/protoc-3.13.0.0 2025-12-04T09:05:04.0733764Z inflating: build/bin/protoc 2025-12-04T09:05:04.0781308Z inflating: build/bin/c10_AllocatorConfig_test 2025-12-04T09:05:04.0825645Z inflating: build/bin/c10_CompileTimeFunctionPointer_test 2025-12-04T09:05:04.0871400Z inflating: build/bin/c10_DeviceGuard_test 2025-12-04T09:05:04.0917341Z inflating: build/bin/c10_Device_test 2025-12-04T09:05:04.0970564Z inflating: build/bin/c10_DispatchKeySet_test 2025-12-04T09:05:04.1017937Z inflating: build/bin/c10_Scalar_test 2025-12-04T09:05:04.1061764Z inflating: build/bin/c10_StreamGuard_test 2025-12-04T09:05:04.1111921Z inflating: build/bin/c10_SymInt_test 2025-12-04T09:05:04.1159978Z inflating: build/bin/c10_InlineDeviceGuard_test 2025-12-04T09:05:04.1209498Z inflating: build/bin/c10_InlineStreamGuard_test 2025-12-04T09:05:04.1253433Z inflating: build/bin/c10_ConstexprCrc_test 2025-12-04T09:05:04.1302973Z inflating: build/bin/c10_SizesAndStrides_test 2025-12-04T09:05:04.1363874Z inflating: build/bin/c10_cow_test 2025-12-04T09:05:04.1410878Z inflating: build/bin/c10_Bitset_test 2025-12-04T09:05:04.1455009Z inflating: build/bin/c10_ArrayRef_test 2025-12-04T09:05:04.1499315Z inflating: build/bin/c10_DeadlockDetection_test 2025-12-04T09:05:04.1545891Z inflating: build/bin/c10_IntrusiveList_test 2025-12-04T09:05:04.1595558Z inflating: build/bin/c10_LeftRight_test 2025-12-04T09:05:04.1640475Z inflating: build/bin/c10_Half_test 2025-12-04T09:05:04.1684644Z inflating: build/bin/c10_Semaphore_test 2025-12-04T09:05:04.1735287Z inflating: build/bin/c10_Enumerate_test 2025-12-04T09:05:04.1782606Z inflating: build/bin/c10_NetworkFlow_test 2025-12-04T09:05:04.1827245Z inflating: build/bin/c10_Synchronized_test 2025-12-04T09:05:04.1876521Z inflating: build/bin/c10_ThreadLocal_test 2025-12-04T09:05:04.1922529Z inflating: build/bin/c10_accumulate_test 2025-12-04T09:05:04.1968764Z inflating: build/bin/c10_TypeIndex_test 2025-12-04T09:05:04.2013860Z inflating: build/bin/c10_bit_cast_test 2025-12-04T09:05:04.2063390Z inflating: build/bin/c10_bfloat16_test 2025-12-04T09:05:04.2113646Z inflating: build/bin/c10_complex_math_test 2025-12-04T09:05:04.2160291Z inflating: build/bin/c10_exception_test 2025-12-04T09:05:04.2204460Z inflating: build/bin/c10_error_test 2025-12-04T09:05:04.2253251Z inflating: build/bin/c10_complex_test 2025-12-04T09:05:04.2298099Z inflating: build/bin/c10_flags_test 2025-12-04T09:05:04.2342941Z inflating: build/bin/c10_generic_math_test 2025-12-04T09:05:04.2475480Z inflating: build/bin/c10_intrusive_ptr_test 2025-12-04T09:05:04.2520745Z inflating: build/bin/c10_irange_test 2025-12-04T09:05:04.2568106Z inflating: build/bin/c10_lazy_test 2025-12-04T09:05:04.2613140Z inflating: build/bin/c10_nofatal_test 2025-12-04T09:05:04.2663282Z inflating: build/bin/c10_logging_test 2025-12-04T09:05:04.2728188Z inflating: build/bin/c10_optional_test 2025-12-04T09:05:04.2782936Z inflating: build/bin/c10_ordered_preserving_dict_test 2025-12-04T09:05:04.2912212Z inflating: build/bin/c10_small_vector_test 2025-12-04T09:05:04.2959401Z inflating: build/bin/c10_registry_test 2025-12-04T09:05:04.3009502Z inflating: build/bin/c10_string_util_test 2025-12-04T09:05:04.3055434Z inflating: build/bin/c10_ssize_test 2025-12-04T09:05:04.3099075Z inflating: build/bin/c10_string_view_test 2025-12-04T09:05:04.3137973Z inflating: build/bin/c10_intrusive_ptr_benchmark 2025-12-04T09:05:04.3182499Z inflating: build/bin/c10_tempfile_test 2025-12-04T09:05:04.3231732Z inflating: build/bin/c10_typeid_test 2025-12-04T09:05:04.3278673Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_1_var_test 2025-12-04T09:05:04.3326001Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_stream 2025-12-04T09:05:04.3373063Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_blocks_and_threads 2025-12-04T09:05:04.3417202Z inflating: build/bin/c10_cuda_CUDATest 2025-12-04T09:05:04.3464390Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_thread_and_block_and_device 2025-12-04T09:05:04.3510708Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_from_2_processes 2025-12-04T09:05:04.3557794Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_multiple_blocks 2025-12-04T09:05:04.3604546Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_same_block 2025-12-04T09:05:04.4087636Z inflating: build/bin/vec_test_all_types_DEFAULT 2025-12-04T09:05:04.4582011Z inflating: build/bin/vec_test_all_types_AVX512 2025-12-04T09:05:04.5085573Z inflating: build/bin/vec_test_all_types_AVX2 2025-12-04T09:05:04.5129473Z inflating: build/bin/test_vec_half_DEFAULT 2025-12-04T09:05:04.5213664Z inflating: build/bin/test_aoti_abi_check 2025-12-04T09:05:04.5258624Z inflating: build/bin/test_vec_half_AVX512 2025-12-04T09:05:04.5303990Z inflating: build/bin/test_vec_half_AVX2 2025-12-04T09:05:04.5367689Z inflating: build/bin/Dict_test 2025-12-04T09:05:04.5414426Z inflating: build/bin/Dimname_test 2025-12-04T09:05:04.5471765Z inflating: build/bin/MaybeOwned_test 2025-12-04T09:05:04.5521698Z inflating: build/bin/NamedTensor_test 2025-12-04T09:05:04.5573790Z inflating: build/bin/apply_utils_test 2025-12-04T09:05:04.5625660Z inflating: build/bin/atest 2025-12-04T09:05:04.5681741Z inflating: build/bin/basic 2025-12-04T09:05:04.5729809Z inflating: build/bin/broadcast_test 2025-12-04T09:05:04.5775166Z inflating: build/bin/cpu_allocator_test 2025-12-04T09:05:04.5826425Z inflating: build/bin/cpu_generator_test 2025-12-04T09:05:04.5873552Z inflating: build/bin/cpu_profiling_allocator_test 2025-12-04T09:05:04.5954132Z inflating: build/bin/cpu_rng_test 2025-12-04T09:05:04.6000273Z inflating: build/bin/dlconvertor_test 2025-12-04T09:05:04.6051100Z inflating: build/bin/extension_backend_test 2025-12-04T09:05:04.6100081Z inflating: build/bin/half_test 2025-12-04T09:05:04.6183622Z inflating: build/bin/ivalue_test 2025-12-04T09:05:04.6227902Z inflating: build/bin/lazy_tensor_test 2025-12-04T09:05:04.6274901Z inflating: build/bin/math_kernel_test 2025-12-04T09:05:04.6321508Z inflating: build/bin/memory_format_test 2025-12-04T09:05:04.6369128Z inflating: build/bin/memory_overlapping_test 2025-12-04T09:05:04.6416743Z inflating: build/bin/mobile_memory_cleanup 2025-12-04T09:05:04.6466822Z inflating: build/bin/native_test 2025-12-04T09:05:04.6512470Z inflating: build/bin/operator_name_test 2025-12-04T09:05:04.6557657Z inflating: build/bin/operators_test 2025-12-04T09:05:04.6604170Z inflating: build/bin/packedtensoraccessor_test 2025-12-04T09:05:04.6663218Z inflating: build/bin/pow_test 2025-12-04T09:05:04.6713163Z inflating: build/bin/quantized_test 2025-12-04T09:05:04.6757692Z inflating: build/bin/reduce_ops_test 2025-12-04T09:05:04.6803332Z inflating: build/bin/reportMemoryUsage_test 2025-12-04T09:05:04.6852501Z inflating: build/bin/scalar_tensor_test 2025-12-04T09:05:04.6903216Z inflating: build/bin/scalar_test 2025-12-04T09:05:04.6949072Z inflating: build/bin/StorageUtils_test 2025-12-04T09:05:04.6995224Z inflating: build/bin/stride_properties_test 2025-12-04T09:05:04.7064695Z inflating: build/bin/tensor_iterator_test 2025-12-04T09:05:04.7112883Z inflating: build/bin/test_parallel 2025-12-04T09:05:04.7157861Z inflating: build/bin/thread_init_test 2025-12-04T09:05:04.7206757Z inflating: build/bin/type_ptr_test 2025-12-04T09:05:04.7259145Z inflating: build/bin/type_test 2025-12-04T09:05:04.7306451Z inflating: build/bin/undefined_tensor_test 2025-12-04T09:05:04.7350591Z inflating: build/bin/verify_api_visibility 2025-12-04T09:05:04.7412569Z inflating: build/bin/legacy_vmap_test 2025-12-04T09:05:04.7458197Z inflating: build/bin/weakref_test 2025-12-04T09:05:04.7504225Z inflating: build/bin/wrapdim_test 2025-12-04T09:05:04.7549493Z inflating: build/bin/xla_tensor_test 2025-12-04T09:05:04.7601850Z inflating: build/bin/IListRef_test 2025-12-04T09:05:04.7692995Z inflating: build/bin/List_test 2025-12-04T09:05:04.7751344Z inflating: build/bin/KernelFunction_test 2025-12-04T09:05:04.7853207Z inflating: build/bin/kernel_function_legacy_test 2025-12-04T09:05:04.7935368Z inflating: build/bin/kernel_function_test 2025-12-04T09:05:04.8043136Z inflating: build/bin/kernel_lambda_legacy_test 2025-12-04T09:05:04.8130432Z inflating: build/bin/kernel_lambda_test 2025-12-04T09:05:04.8183459Z inflating: build/bin/kernel_stackbased_test 2025-12-04T09:05:04.8265475Z inflating: build/bin/make_boxed_from_unboxed_functor_test 2025-12-04T09:05:04.8311312Z inflating: build/bin/CppSignature_test 2025-12-04T09:05:04.8360034Z inflating: build/bin/backend_fallback_test 2025-12-04T09:05:04.8403738Z inflating: build/bin/op_allowlist_test 2025-12-04T09:05:04.8665148Z inflating: build/bin/op_registration_test 2025-12-04T09:05:04.8723493Z inflating: build/bin/inline_container_test 2025-12-04T09:05:04.8770944Z inflating: build/bin/cuda_allocator_test 2025-12-04T09:05:04.8817674Z inflating: build/bin/cuda_apply_test 2025-12-04T09:05:04.8871306Z inflating: build/bin/cuda_atomic_ops_test 2025-12-04T09:05:04.8921164Z inflating: build/bin/cuda_caching_host_allocator_test 2025-12-04T09:05:04.8983176Z inflating: build/bin/cuda_complex_math_test 2025-12-04T09:05:04.9035844Z inflating: build/bin/cuda_complex_test 2025-12-04T09:05:04.9092528Z inflating: build/bin/cuda_cub_test 2025-12-04T09:05:04.9139432Z inflating: build/bin/cuda_cublas_handle_pool_test 2025-12-04T09:05:04.9183676Z inflating: build/bin/cuda_device_test 2025-12-04T09:05:04.9249280Z inflating: build/bin/cuda_distributions_test 2025-12-04T09:05:04.9295414Z inflating: build/bin/cuda_dlconvertor_test 2025-12-04T09:05:04.9342713Z inflating: build/bin/cuda_event_test 2025-12-04T09:05:04.9386555Z inflating: build/bin/cuda_exchange_device_test 2025-12-04T09:05:04.9436768Z inflating: build/bin/cuda_generator_test 2025-12-04T09:05:04.9481769Z inflating: build/bin/cuda_half_test 2025-12-04T09:05:04.9526236Z inflating: build/bin/cuda_allocatorTraceTracker_test 2025-12-04T09:05:04.9580404Z inflating: build/bin/cuda_stream_test 2025-12-04T09:05:04.9627246Z inflating: build/bin/cuda_reportMemoryUsage_test 2025-12-04T09:05:04.9671549Z inflating: build/bin/cuda_cudnn_test 2025-12-04T09:05:04.9717395Z inflating: build/bin/cuda_integer_divider_test 2025-12-04T09:05:04.9761693Z inflating: build/bin/cuda_optional_test 2025-12-04T09:05:04.9808260Z inflating: build/bin/cuda_packedtensoraccessor_test 2025-12-04T09:05:04.9854806Z inflating: build/bin/cuda_vectorized_test 2025-12-04T09:05:05.0753177Z inflating: build/bin/test_jit 2025-12-04T09:05:05.1046414Z inflating: build/bin/test_lazy 2025-12-04T09:05:05.1093189Z inflating: build/bin/BackoffTest 2025-12-04T09:05:05.1140296Z inflating: build/bin/FileStoreTest 2025-12-04T09:05:05.1190678Z inflating: build/bin/TCPStoreTest 2025-12-04T09:05:05.1238145Z inflating: build/bin/HashStoreTest 2025-12-04T09:05:05.1250100Z inflating: build/bin/ProcessGroupMPITest 2025-12-04T09:05:05.1252884Z inflating: build/bin/example_allreduce 2025-12-04T09:05:05.1301898Z inflating: build/bin/test_dist_autograd 2025-12-04T09:05:05.1361765Z inflating: build/bin/test_cpp_rpc 2025-12-04T09:05:05.1420937Z inflating: build/bin/ProcessGroupGlooTest 2025-12-04T09:05:05.1471407Z inflating: build/bin/ProcessGroupGlooAsyncTest 2025-12-04T09:05:05.1527780Z inflating: build/bin/ProcessGroupNCCLTest 2025-12-04T09:05:05.1582112Z inflating: build/bin/ProcessGroupNCCLErrorsTest 2025-12-04T09:05:05.2539451Z inflating: build/bin/test_api 2025-12-04T09:05:05.2541984Z inflating: build/bin/parallel_benchmark 2025-12-04T09:05:05.2545350Z inflating: build/bin/torch_shm_manager 2025-12-04T09:05:05.2545794Z creating: .additional_ci_files/ 2025-12-04T09:05:05.2598383Z inflating: .additional_ci_files/test-times.json 2025-12-04T09:05:05.2789004Z inflating: .additional_ci_files/test-class-times.json 2025-12-04T09:05:05.2820486Z ##[group]Run rm artifacts.zip 2025-12-04T09:05:05.2820718Z rm artifacts.zip 2025-12-04T09:05:05.2832697Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:05:05.2832984Z env: 2025-12-04T09:05:05.2833325Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:05.2833537Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:05.2833767Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:05.2834025Z ##[endgroup] 2025-12-04T09:05:05.3916392Z ##[group]Run df -H 2025-12-04T09:05:05.3916606Z df -H 2025-12-04T09:05:05.3923679Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:05:05.3924016Z env: 2025-12-04T09:05:05.3924196Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:05.3924426Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:05.3924701Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:05.3925014Z ##[endgroup] 2025-12-04T09:05:05.3973777Z Filesystem Size Used Avail Use% Mounted on 2025-12-04T09:05:05.3974156Z devtmpfs 4.2M 0 4.2M 0% /dev 2025-12-04T09:05:05.3974468Z tmpfs 33G 0 33G 0% /dev/shm 2025-12-04T09:05:05.3974755Z tmpfs 13G 775k 13G 1% /run 2025-12-04T09:05:05.3975036Z /dev/nvme0n1p1 161G 54G 108G 34% / 2025-12-04T09:05:05.3975335Z tmpfs 33G 17k 33G 1% /tmp 2025-12-04T09:05:05.3975637Z /dev/nvme0n1p128 11M 1.4M 9.2M 13% /boot/efi 2025-12-04T09:05:05.3975943Z tmpfs 6.5G 0 6.5G 0% /run/user/0 2025-12-04T09:05:05.4007356Z Prepare all required actions 2025-12-04T09:05:05.4008170Z Getting action download info 2025-12-04T09:05:05.5776944Z ##[group]Run ./.github/actions/download-td-artifacts 2025-12-04T09:05:05.5777201Z with: 2025-12-04T09:05:05.5777342Z env: 2025-12-04T09:05:05.5777493Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:05.5777682Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:05.5777911Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:05.5778173Z ##[endgroup] 2025-12-04T09:05:05.5880707Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T09:05:05.5880967Z with: 2025-12-04T09:05:05.5881115Z name: td_results 2025-12-04T09:05:05.5881289Z s3-bucket: gha-artifacts 2025-12-04T09:05:05.5881479Z region: us-east-1 2025-12-04T09:05:05.5881632Z env: 2025-12-04T09:05:05.5881781Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:05.5881967Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:05.5882200Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:05.5882490Z ##[endgroup] 2025-12-04T09:05:05.9882037Z (node:58918) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T09:05:05.9882609Z 2025-12-04T09:05:05.9882832Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T09:05:05.9883425Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T09:05:05.9884017Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T09:05:06.0883324Z Found 1 objects with prefix pytorch/pytorch/19922768520/td_results/ 2025-12-04T09:05:06.0884012Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/td_results.json 2025-12-04T09:05:06.1445375Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/td_results.json 2025-12-04T09:05:06.1450094Z Artifact download has finished successfully 2025-12-04T09:05:06.1695888Z ##[group]Run mkdir -p .additional_ci_files 2025-12-04T09:05:06.1696184Z mkdir -p .additional_ci_files 2025-12-04T09:05:06.1696487Z mv td_results.json .additional_ci_files/td_results.json || true 2025-12-04T09:05:06.1706016Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:05:06.1706285Z env: 2025-12-04T09:05:06.1706445Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:06.1706643Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:06.1706874Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:06.1707159Z ##[endgroup] 2025-12-04T09:05:06.1824713Z ##[group]Run .github/scripts/parse_ref.py 2025-12-04T09:05:06.1825328Z .github/scripts/parse_ref.py 2025-12-04T09:05:06.1832188Z shell: /usr/bin/bash -e {0} 2025-12-04T09:05:06.1832395Z env: 2025-12-04T09:05:06.1832565Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:06.1832758Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:06.1832979Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:06.1833258Z ##[endgroup] 2025-12-04T09:05:06.2029685Z Setting output branch=main 2025-12-04T09:05:06.2127262Z Prepare all required actions 2025-12-04T09:05:06.2127592Z Getting action download info 2025-12-04T09:05:06.3553977Z ##[group]Run ./.github/actions/filter-test-configs 2025-12-04T09:05:06.3554234Z with: 2025-12-04T09:05:06.3554569Z github-token: *** 2025-12-04T09:05:06.3561371Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T09:05:06.3569064Z job-name: linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:05:06.3569535Z env: 2025-12-04T09:05:06.3569697Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:06.3569887Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:06.3570114Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:06.3570365Z ##[endgroup] 2025-12-04T09:05:06.3659995Z ##[group]Run nick-fields/retry@v3.0.0 2025-12-04T09:05:06.3660216Z with: 2025-12-04T09:05:06.3660369Z shell: bash 2025-12-04T09:05:06.3660522Z timeout_minutes: 10 2025-12-04T09:05:06.3660693Z max_attempts: 5 2025-12-04T09:05:06.3660864Z retry_wait_seconds: 30 2025-12-04T09:05:06.3661620Z command: set -eux # PyYAML 6.0 doesn't work with MacOS x86 anymore # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2 python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T09:05:06.3662469Z polling_interval_seconds: 1 2025-12-04T09:05:06.3662684Z warning_on_retry: true 2025-12-04T09:05:06.3662876Z continue_on_error: false 2025-12-04T09:05:06.3663054Z env: 2025-12-04T09:05:06.3663206Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:06.3663397Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:06.3663610Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:06.3663998Z GITHUB_TOKEN: *** 2025-12-04T09:05:06.3664174Z ##[endgroup] 2025-12-04T09:05:06.4566850Z + python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T09:05:06.6645996Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:05:06.8188359Z Collecting requests==2.27.1 2025-12-04T09:05:06.8378780Z Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB) 2025-12-04T09:05:07.0469387Z Collecting pyyaml==6.0.2 2025-12-04T09:05:07.0508312Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB) 2025-12-04T09:05:07.0759045Z Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3.9/site-packages (from requests==2.27.1) (2.10) 2025-12-04T09:05:07.1176692Z Collecting certifi>=2017.4.17 2025-12-04T09:05:07.1217867Z Downloading certifi-2025.11.12-py3-none-any.whl (159 kB) 2025-12-04T09:05:07.4749846Z Collecting charset-normalizer~=2.0.0 2025-12-04T09:05:07.4787007Z Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB) 2025-12-04T09:05:07.4827679Z Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/lib/python3.9/site-packages (from requests==2.27.1) (1.25.10) 2025-12-04T09:05:07.5545302Z Installing collected packages: charset-normalizer, certifi, requests, pyyaml 2025-12-04T09:05:07.6596107Z Successfully installed certifi-2025.11.12 charset-normalizer-2.0.12 pyyaml-6.0.2 requests-2.27.1 2025-12-04T09:05:08.4365333Z Command completed after 1 attempt(s). 2025-12-04T09:05:08.4434693Z ##[group]Run set -x 2025-12-04T09:05:08.4434893Z set -x 2025-12-04T09:05:08.4435057Z  2025-12-04T09:05:08.4435335Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T09:05:08.4435688Z # in runner workspace 2025-12-04T09:05:08.4435977Z python3 "${GITHUB_ACTION_PATH}/../../scripts/parse_ref.py" 2025-12-04T09:05:08.4444958Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:05:08.4445239Z env: 2025-12-04T09:05:08.4445401Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:08.4445599Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:08.4445841Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:08.4446108Z ##[endgroup] 2025-12-04T09:05:08.4473995Z + python3 /home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/filter-test-configs/../../scripts/parse_ref.py 2025-12-04T09:05:08.4641051Z Setting output branch=main 2025-12-04T09:05:08.4693310Z ##[group]Run echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T09:05:08.4693679Z echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T09:05:08.4693933Z echo "Job name: ${JOB_NAME}" 2025-12-04T09:05:08.4694148Z  2025-12-04T09:05:08.4694433Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T09:05:08.4694796Z # in runner workspace 2025-12-04T09:05:08.4695144Z python3 "${GITHUB_ACTION_PATH}/../../scripts/filter_test_configs.py" \ 2025-12-04T09:05:08.4695508Z  --workflow "${GITHUB_WORKFLOW}" \ 2025-12-04T09:05:08.4695759Z  --job-name "${JOB_NAME}" \ 2025-12-04T09:05:08.4703309Z  --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}]}" \ 2025-12-04T09:05:08.4710810Z  --selected-test-configs "" \ 2025-12-04T09:05:08.4711055Z  --pr-number "${PR_NUMBER}" \ 2025-12-04T09:05:08.4711273Z  --tag "${TAG}" \ 2025-12-04T09:05:08.4711476Z  --event-name "${EVENT_NAME}" \ 2025-12-04T09:05:08.4711703Z  --schedule "${SCHEDULE}" \ 2025-12-04T09:05:08.4711919Z  --branch "${HEAD_BRANCH}" 2025-12-04T09:05:08.4718930Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:05:08.4719216Z env: 2025-12-04T09:05:08.4719377Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:08.4719573Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:08.4719806Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:08.4720257Z GITHUB_TOKEN: *** 2025-12-04T09:05:08.4720733Z JOB_NAME: linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:05:08.4721205Z PR_NUMBER: 2025-12-04T09:05:08.4721368Z TAG: 2025-12-04T09:05:08.4721521Z EVENT_NAME: schedule 2025-12-04T09:05:08.4721701Z SCHEDULE: 29 8 * * * 2025-12-04T09:05:08.4721862Z HEAD_BRANCH: main 2025-12-04T09:05:08.4722028Z ##[endgroup] 2025-12-04T09:05:08.4747907Z Workflow: trunk 2025-12-04T09:05:08.4748831Z Job name: linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:05:08.6558545Z Setting output keep-going=True 2025-12-04T09:05:08.6558897Z Setting output ci-verbose-test-logs=False 2025-12-04T09:05:08.6559263Z Setting output ci-test-showlocals=False 2025-12-04T09:05:08.6559590Z Setting output ci-no-test-timeout=False 2025-12-04T09:05:08.6559909Z Setting output ci-no-td=False 2025-12-04T09:05:08.6560215Z Setting output ci-td-distributed=False 2025-12-04T09:05:08.6560557Z Setting output is-unstable=False 2025-12-04T09:05:08.6560870Z Setting output reenabled-issues= 2025-12-04T09:05:08.6580068Z Setting output test-matrix={"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T09:05:08.6595763Z Setting output is-test-matrix-empty=False 2025-12-04T09:05:08.6666423Z ##[group]Run echo "Filtered matrix:" 2025-12-04T09:05:08.6666736Z echo "Filtered matrix:" 2025-12-04T09:05:08.6682754Z echo "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "lf.linux.g6.4xlarge.experimental.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "lf.linux.g4dn.12xlarge.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "pr_time_benchmarks", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "libtorch_agnostic_targetting", "shard": 1, "num_shards": 1, "runner": "linux.g4dn.metal.nvidia.gpu", "rerun_disabled_tests": "rerun_disabled_tests"}]}" 2025-12-04T09:05:08.6698480Z  2025-12-04T09:05:08.6698628Z echo 2025-12-04T09:05:08.6698833Z echo "Is the current job unstable? False" 2025-12-04T09:05:08.6699064Z  2025-12-04T09:05:08.6699209Z echo 2025-12-04T09:05:08.6699481Z echo "Is keep-going label set? True" 2025-12-04T09:05:08.6699698Z  2025-12-04T09:05:08.6699843Z echo 2025-12-04T09:05:08.6700018Z echo "Reenabled issues? " 2025-12-04T09:05:08.6707236Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:05:08.6707515Z env: 2025-12-04T09:05:08.6707678Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:08.6707864Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:08.6708094Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:08.6708370Z ##[endgroup] 2025-12-04T09:05:08.6732529Z Filtered matrix: 2025-12-04T09:05:08.6758057Z {include: [{config: default, shard: 1, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: default, shard: 1, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 1, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 1, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 2, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: default, shard: 2, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 2, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 2, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 3, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: default, shard: 3, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 3, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 3, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 4, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: default, shard: 4, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 4, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 4, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 5, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: default, shard: 5, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 5, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 5, num_shards: 5, runner: lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 1, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: distributed, shard: 1, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 1, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: distributed, shard: 1, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 2, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: distributed, shard: 2, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 2, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: distributed, shard: 2, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 3, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: distributed, shard: 3, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 3, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: distributed, shard: 3, num_shards: 3, runner: lf.linux.g4dn.12xlarge.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: pr_time_benchmarks, shard: 1, num_shards: 1, runner: linux.g4dn.metal.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: pr_time_benchmarks, shard: 1, num_shards: 1, runner: linux.g4dn.metal.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: pr_time_benchmarks, shard: 1, num_shards: 1, runner: linux.g4dn.metal.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: pr_time_benchmarks, shard: 1, num_shards: 1, runner: linux.g4dn.metal.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}, {config: libtorch_agnostic_targetting, shard: 1, num_shards: 1, runner: linux.g4dn.metal.nvidia.gpu, mem_leak_check: mem_leak_check}, {config: libtorch_agnostic_targetting, shard: 1, num_shards: 1, runner: linux.g4dn.metal.nvidia.gpu, mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: libtorch_agnostic_targetting, shard: 1, num_shards: 1, runner: linux.g4dn.metal.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: libtorch_agnostic_targetting, shard: 1, num_shards: 1, runner: linux.g4dn.metal.nvidia.gpu, rerun_disabled_tests: rerun_disabled_tests}]} 2025-12-04T09:05:08.6773589Z 2025-12-04T09:05:08.6773672Z Is the current job unstable? False 2025-12-04T09:05:08.6773819Z 2025-12-04T09:05:08.6773902Z Is keep-going label set? True 2025-12-04T09:05:08.6774046Z 2025-12-04T09:05:08.6774113Z Reenabled issues? 2025-12-04T09:05:08.6800901Z ##[group]Run echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T09:05:08.6801319Z echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T09:05:08.6808164Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:05:08.6808435Z env: 2025-12-04T09:05:08.6808598Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:08.6808795Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:08.6809029Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:08.6809289Z JOB_TIMEOUT: 360 2025-12-04T09:05:08.6809455Z ##[endgroup] 2025-12-04T09:05:08.6855986Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:05:08.6856428Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:05:08.6856774Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:05:08.6863782Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:05:08.6864054Z env: 2025-12-04T09:05:08.6864212Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:08.6864404Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:08.6864633Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:08.6865023Z ##[endgroup] 2025-12-04T09:05:08.6963093Z ##[group]Run set -x 2025-12-04T09:05:08.6963370Z set -x 2025-12-04T09:05:08.6963542Z  2025-12-04T09:05:08.6963729Z if [[ $TEST_CONFIG == 'multigpu' ]]; then 2025-12-04T09:05:08.6964007Z  TEST_COMMAND=.ci/pytorch/multigpu-test.sh 2025-12-04T09:05:08.6964292Z elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then 2025-12-04T09:05:08.6964549Z  TEST_COMMAND=.ci/onnx/test.sh 2025-12-04T09:05:08.6964762Z else 2025-12-04T09:05:08.6964945Z  TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T09:05:08.6965159Z fi 2025-12-04T09:05:08.6965318Z  2025-12-04T09:05:08.6965499Z # Leaving 1GB for the runner and other things 2025-12-04T09:05:08.6965919Z TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo) 2025-12-04T09:05:08.6966568Z # https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap 2025-12-04T09:05:08.6967076Z # comes from https://github.com/pytorch/test-infra/pull/6058 2025-12-04T09:05:08.6967468Z TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" + 3)) 2025-12-04T09:05:08.6967763Z  2025-12-04T09:05:08.6967947Z if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then 2025-12-04T09:05:08.6968197Z  SHM_OPTS= 2025-12-04T09:05:08.6968375Z  JENKINS_USER= 2025-12-04T09:05:08.6968625Z  # ensure that docker container cleanly exits in 12 hours 2025-12-04T09:05:08.6968965Z  # if for some reason cleanup action doesn't stop container 2025-12-04T09:05:08.6969241Z  # when job is cancelled 2025-12-04T09:05:08.6969459Z  DOCKER_SHELL_CMD="sleep 12h" 2025-12-04T09:05:08.6969687Z  USED_IMAGE="${DOCKER_IMAGE_S390X}" 2025-12-04T09:05:08.6969899Z else 2025-12-04T09:05:08.6970077Z  SHM_OPTS="--shm-size=${SHM_SIZE}" 2025-12-04T09:05:08.6970303Z  JENKINS_USER="--user jenkins" 2025-12-04T09:05:08.6970520Z  DOCKER_SHELL_CMD= 2025-12-04T09:05:08.6970734Z  USED_IMAGE="${DOCKER_IMAGE}" 2025-12-04T09:05:08.6970933Z fi 2025-12-04T09:05:08.6971069Z  2025-12-04T09:05:08.6971303Z # detached container should get cleaned up by teardown_ec2_linux 2025-12-04T09:05:08.6971670Z # TODO: Stop building test binaries as part of the build phase 2025-12-04T09:05:08.6972080Z # Used for GPU_FLAG, SHM_OPTS, JENKINS_USER and DOCKER_SHELL_CMD since that doesn't play nice 2025-12-04T09:05:08.6972534Z # shellcheck disable=SC2086,SC2090 2025-12-04T09:05:08.6972767Z container_name=$(docker run \ 2025-12-04T09:05:08.6972988Z  ${GPU_FLAG:-} \ 2025-12-04T09:05:08.6973196Z  ${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \ 2025-12-04T09:05:08.6973440Z  -e BUILD_ENVIRONMENT \ 2025-12-04T09:05:08.6973652Z  -e PR_NUMBER \ 2025-12-04T09:05:08.6973840Z  -e GITHUB_ACTIONS \ 2025-12-04T09:05:08.6974044Z  -e GITHUB_REPOSITORY \ 2025-12-04T09:05:08.6974265Z  -e GITHUB_WORKFLOW \ 2025-12-04T09:05:08.6974459Z  -e GITHUB_JOB \ 2025-12-04T09:05:08.6974645Z  -e GITHUB_RUN_ID \ 2025-12-04T09:05:08.6974847Z  -e GITHUB_RUN_NUMBER \ 2025-12-04T09:05:08.6975065Z  -e GITHUB_RUN_ATTEMPT \ 2025-12-04T09:05:08.6975268Z  -e JOB_ID \ 2025-12-04T09:05:08.6975451Z  -e JOB_NAME \ 2025-12-04T09:05:08.6975636Z  -e BASE_SHA \ 2025-12-04T09:05:08.6975806Z  -e BRANCH \ 2025-12-04T09:05:08.6975980Z  -e SHA1 \ 2025-12-04T09:05:08.6976160Z  -e AWS_DEFAULT_REGION \ 2025-12-04T09:05:08.6976365Z  -e IN_WHEEL_TEST \ 2025-12-04T09:05:08.6976563Z  -e SHARD_NUMBER \ 2025-12-04T09:05:08.6976759Z  -e TEST_CONFIG \ 2025-12-04T09:05:08.6977108Z  -e NUM_TEST_SHARDS \ 2025-12-04T09:05:08.6977420Z  -e REENABLED_ISSUES \ 2025-12-04T09:05:08.6977664Z  -e CONTINUE_THROUGH_ERROR \ 2025-12-04T09:05:08.6977896Z  -e VERBOSE_TEST_LOGS \ 2025-12-04T09:05:08.6978098Z  -e TEST_SHOWLOCALS \ 2025-12-04T09:05:08.6978298Z  -e NO_TEST_TIMEOUT \ 2025-12-04T09:05:08.6978485Z  -e NO_TD \ 2025-12-04T09:05:08.6978673Z  -e TD_DISTRIBUTED \ 2025-12-04T09:05:08.6978866Z  -e PR_LABELS \ 2025-12-04T09:05:08.6979077Z  -e MAX_JOBS="$(nproc --ignore=2)" \ 2025-12-04T09:05:08.6979304Z  -e SCCACHE_BUCKET \ 2025-12-04T09:05:08.6979500Z  -e SCCACHE_REGION \ 2025-12-04T09:05:08.6979695Z  -e XLA_CUDA \ 2025-12-04T09:05:08.6979891Z  -e XLA_CLANG_CACHE_S3_BUCKET_NAME \ 2025-12-04T09:05:08.6980144Z  -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \ 2025-12-04T09:05:08.6980404Z  -e PYTORCH_TEST_RERUN_DISABLED_TESTS \ 2025-12-04T09:05:08.6980668Z  -e SKIP_SCCACHE_INITIALIZATION=1 \ 2025-12-04T09:05:08.6980904Z  -e HUGGING_FACE_HUB_TOKEN \ 2025-12-04T09:05:08.6981137Z  -e VLLM_TEST_HUGGING_FACE_TOKEN \ 2025-12-04T09:05:08.6981375Z  -e SCRIBE_GRAPHQL_ACCESS_TOKEN \ 2025-12-04T09:05:08.6981592Z  -e DASHBOARD_TAG \ 2025-12-04T09:05:08.6981790Z  -e ARTIFACTS_FILE_SUFFIX \ 2025-12-04T09:05:08.6982042Z  --memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \ 2025-12-04T09:05:08.6982339Z  --memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \ 2025-12-04T09:05:08.6982635Z  --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \ 2025-12-04T09:05:08.6982922Z  --security-opt seccomp=unconfined \ 2025-12-04T09:05:08.6983166Z  --cap-add=SYS_PTRACE \ 2025-12-04T09:05:08.6983365Z  --ipc=host \ 2025-12-04T09:05:08.6983548Z  ${SHM_OPTS} \ 2025-12-04T09:05:08.6983730Z  --tty \ 2025-12-04T09:05:08.6983901Z  --detach \ 2025-12-04T09:05:08.6984098Z  --name="${container_name}" \ 2025-12-04T09:05:08.6984324Z  ${JENKINS_USER} \ 2025-12-04T09:05:08.6984560Z  -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \ 2025-12-04T09:05:08.6984859Z  -w /var/lib/jenkins/workspace \ 2025-12-04T09:05:08.6985081Z  "${USED_IMAGE}" \ 2025-12-04T09:05:08.6985275Z  ${DOCKER_SHELL_CMD} 2025-12-04T09:05:08.6985462Z ) 2025-12-04T09:05:08.6985693Z echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}" 2025-12-04T09:05:08.6985979Z  2025-12-04T09:05:08.6986158Z if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then 2025-12-04T09:05:08.6986565Z  docker exec -t "${container_name}" sh -c "python3 -m pip install -r .ci/docker/requirements-ci.txt" 2025-12-04T09:05:08.6986927Z fi 2025-12-04T09:05:08.6987064Z  2025-12-04T09:05:08.6987405Z docker exec -t "${container_name}" sh -c "python3 -m pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}" 2025-12-04T09:05:08.6994173Z shell: /usr/bin/bash -e {0} 2025-12-04T09:05:08.6994368Z env: 2025-12-04T09:05:08.6994525Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:05:08.6994714Z HAS_NVIDIA_GPU: true 2025-12-04T09:05:08.6994942Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:08.6995239Z BUILD_ENVIRONMENT: linux-jammy-cuda12.8-py3.10-gcc11 2025-12-04T09:05:08.6995486Z PR_NUMBER: 2025-12-04T09:05:08.6995659Z GITHUB_REPOSITORY: pytorch/pytorch 2025-12-04T09:05:08.6995867Z GITHUB_WORKFLOW: trunk 2025-12-04T09:05:08.6996044Z GITHUB_JOB: test 2025-12-04T09:05:08.6996219Z GITHUB_RUN_ID: 19922768520 2025-12-04T09:05:08.6996405Z GITHUB_RUN_NUMBER: 158165 2025-12-04T09:05:08.6996592Z GITHUB_RUN_ATTEMPT: 1 2025-12-04T09:05:08.6996764Z JOB_ID: 57116084853 2025-12-04T09:05:08.6997196Z JOB_NAME: linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:05:08.6997846Z BRANCH: main 2025-12-04T09:05:08.6998054Z SHA1: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:08.6998334Z BASE_SHA: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:08.6998568Z TEST_CONFIG: default 2025-12-04T09:05:08.6998738Z SHARD_NUMBER: 1 2025-12-04T09:05:08.6998903Z NUM_TEST_SHARDS: 5 2025-12-04T09:05:08.6999058Z EXTRA_FLAGS: 2025-12-04T09:05:08.6999220Z OP_BENCHMARK_TESTS: 2025-12-04T09:05:08.6999392Z REENABLED_ISSUES: 2025-12-04T09:05:08.6999566Z CONTINUE_THROUGH_ERROR: True 2025-12-04T09:05:08.6999758Z VERBOSE_TEST_LOGS: False 2025-12-04T09:05:08.6999943Z TEST_SHOWLOCALS: False 2025-12-04T09:05:08.7000126Z NO_TEST_TIMEOUT: False 2025-12-04T09:05:08.7000287Z NO_TD: False 2025-12-04T09:05:08.7000446Z TD_DISTRIBUTED: False 2025-12-04T09:05:08.7000668Z SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2 2025-12-04T09:05:08.7000917Z SCCACHE_REGION: us-east-1 2025-12-04T09:05:08.7001103Z SHM_SIZE: 2g 2025-12-04T09:05:08.7001664Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:05:08.7002668Z DOCKER_IMAGE_S390X: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:05:08.7003277Z XLA_CUDA: 2025-12-04T09:05:08.7003537Z XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:05:08.7003865Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: 0 2025-12-04T09:05:08.7004084Z PYTORCH_TEST_RERUN_DISABLED_TESTS: 1 2025-12-04T09:05:08.7004295Z DASHBOARD_TAG: 2025-12-04T09:05:08.7004619Z VLLM_TEST_HUGGING_FACE_TOKEN: *** 2025-12-04T09:05:08.7004908Z HUGGING_FACE_HUB_TOKEN: *** 2025-12-04T09:05:08.7005193Z SCRIBE_GRAPHQL_ACCESS_TOKEN: *** 2025-12-04T09:05:08.7005577Z ARTIFACTS_FILE_SUFFIX: test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853 2025-12-04T09:05:08.7005972Z ##[endgroup] 2025-12-04T09:05:08.7029972Z + [[ default == \m\u\l\t\i\g\p\u ]] 2025-12-04T09:05:08.7030308Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *onnx* ]] 2025-12-04T09:05:08.7030646Z + TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T09:05:08.7033327Z ++ awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo 2025-12-04T09:05:08.7054927Z + TOTAL_AVAILABLE_MEMORY_IN_GB='59.453 ' 2025-12-04T09:05:08.7055502Z + TOTAL_MEMORY_WITH_SWAP=62 2025-12-04T09:05:08.7056096Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *\s\3\9\0\x* ]] 2025-12-04T09:05:08.7056739Z + SHM_OPTS=--shm-size=2g 2025-12-04T09:05:08.7057120Z + JENKINS_USER='--user jenkins' 2025-12-04T09:05:08.7057434Z + DOCKER_SHELL_CMD= 2025-12-04T09:05:08.7058405Z + USED_IMAGE=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:05:08.7064335Z +++ nproc --ignore=2 2025-12-04T09:05:08.7099166Z ++ docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all -e BUILD_ENVIRONMENT -e PR_NUMBER -e GITHUB_ACTIONS -e GITHUB_REPOSITORY -e GITHUB_WORKFLOW -e GITHUB_JOB -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RUN_ATTEMPT -e JOB_ID -e JOB_NAME -e BASE_SHA -e BRANCH -e SHA1 -e AWS_DEFAULT_REGION -e IN_WHEEL_TEST -e SHARD_NUMBER -e TEST_CONFIG -e NUM_TEST_SHARDS -e REENABLED_ISSUES -e CONTINUE_THROUGH_ERROR -e VERBOSE_TEST_LOGS -e TEST_SHOWLOCALS -e NO_TEST_TIMEOUT -e NO_TD -e TD_DISTRIBUTED -e PR_LABELS -e MAX_JOBS=14 -e SCCACHE_BUCKET -e SCCACHE_REGION -e XLA_CUDA -e XLA_CLANG_CACHE_S3_BUCKET_NAME -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK -e PYTORCH_TEST_RERUN_DISABLED_TESTS -e SKIP_SCCACHE_INITIALIZATION=1 -e HUGGING_FACE_HUB_TOKEN -e VLLM_TEST_HUGGING_FACE_TOKEN -e SCRIBE_GRAPHQL_ACCESS_TOKEN -e DASHBOARD_TAG -e ARTIFACTS_FILE_SUFFIX --memory=59g --memory-swap=62g --env-file=/tmp/github_env_19922768520 --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --ipc=host --shm-size=2g --tty --detach --name= --user jenkins -v /home/ec2-user/actions-runner/_work/pytorch/pytorch:/var/lib/jenkins/workspace -w /var/lib/jenkins/workspace 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:05:17.8244669Z + container_name=c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:05:17.8245567Z + echo DOCKER_CONTAINER_ID=c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:05:17.8246469Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *\s\3\9\0\x* ]] 2025-12-04T09:05:17.8252677Z ++ echo dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T09:05:17.8255195Z + docker exec -t c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 sh -c 'python3 -m pip install dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl[opt-einsum] && .ci/pytorch/test.sh' 2025-12-04T09:05:18.2606553Z Processing ./dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl (from torch==2.10.0a0+gitffd9b0f) 2025-12-04T09:05:18.6184072Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.18.0) 2025-12-04T09:05:18.6187909Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (4.12.2) 2025-12-04T09:05:18.6192179Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.13.3) 2025-12-04T09:05:18.6196176Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (2.8.8) 2025-12-04T09:05:18.6199401Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.1.6) 2025-12-04T09:05:18.6203505Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (2025.10.0) 2025-12-04T09:05:18.6216000Z Requirement already satisfied: opt-einsum>=3.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.3.0) 2025-12-04T09:05:18.6533969Z Requirement already satisfied: numpy>=1.7 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from opt-einsum>=3.3->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.22.4) 2025-12-04T09:05:18.6550865Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.3.0) 2025-12-04T09:05:18.6600924Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.0.3) 2025-12-04T09:05:19.0085812Z Installing collected packages: torch 2025-12-04T09:05:30.0105391Z Successfully installed torch-2.10.0a0+gitffd9b0f 2025-12-04T09:05:30.0628790Z + export TERM=vt100 2025-12-04T09:05:30.0629111Z + TERM=vt100 2025-12-04T09:05:30.0630948Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:05:30.0640559Z + source .ci/pytorch/common.sh 2025-12-04T09:05:30.0643908Z +++ dirname .ci/pytorch/common.sh 2025-12-04T09:05:30.0653006Z ++ source .ci/pytorch/common_utils.sh 2025-12-04T09:05:30.0654329Z +++ declare -f -t trap_add 2025-12-04T09:05:30.0659238Z ++ set -ex -o pipefail 2025-12-04T09:05:30.0659521Z ++ [[ linux-jammy-cuda12.8-py3.10-gcc11 == *rocm* ]] 2025-12-04T09:05:30.0659838Z ++ BUILD_TEST_LIBTORCH=0 2025-12-04T09:05:30.0663507Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:05:30.0670632Z + source .ci/pytorch/common-build.sh 2025-12-04T09:05:30.0671920Z ++ [[ linux-jammy-cuda12.8-py3.10-gcc11 != *win-* ]] 2025-12-04T09:05:30.0678138Z ++++ dirname .ci/pytorch/common-build.sh 2025-12-04T09:05:30.0686348Z +++ cd .ci/pytorch 2025-12-04T09:05:30.0686687Z +++ pwd -P 2025-12-04T09:05:30.0688798Z ++ script_dir=/var/lib/jenkins/workspace/.ci/pytorch 2025-12-04T09:05:30.0689247Z ++ [[ linux-jammy-cuda12.8-py3.10-gcc11 == *-pch* ]] 2025-12-04T09:05:30.0689561Z ++ which sccache 2025-12-04T09:05:30.0703647Z ++ [[ -z ossci-compiler-cache-circleci-v2 ]] 2025-12-04T09:05:30.0704080Z ++ sccache --stop-server 2025-12-04T09:05:30.0732932Z ++ true 2025-12-04T09:05:30.0733183Z ++ rm -f /var/lib/jenkins/sccache_error.log 2025-12-04T09:05:30.0744498Z ++ trap_add sccache_epilogue EXIT 2025-12-04T09:05:30.0744908Z ++ trap_add_cmd=sccache_epilogue 2025-12-04T09:05:30.0745241Z ++ shift 2025-12-04T09:05:30.0745472Z ++ for trap_add_name in "$@" 2025-12-04T09:05:30.0751105Z ++++ trap -p EXIT 2025-12-04T09:05:30.0754152Z +++ eval 'extract_trap_cmd ' 2025-12-04T09:05:30.0754458Z ++++ extract_trap_cmd 2025-12-04T09:05:30.0754763Z ++++ printf '%s\n' '' 2025-12-04T09:05:30.0755007Z +++ printf '%s\n' sccache_epilogue 2025-12-04T09:05:30.0757329Z ++ trap -- ' 2025-12-04T09:05:30.0757550Z sccache_epilogue' EXIT 2025-12-04T09:05:30.0757851Z ++ [[ -n 1 ]] 2025-12-04T09:05:30.0758290Z ++ echo 'Skipping sccache server initialization, setting environment variables' 2025-12-04T09:05:30.0758932Z Skipping sccache server initialization, setting environment variables 2025-12-04T09:05:30.0759359Z ++ export SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:05:30.0759621Z ++ SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:05:30.0759944Z ++ export SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:05:30.0760367Z ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:05:30.0765537Z ++ export RUST_LOG=sccache::server=error 2025-12-04T09:05:30.0765881Z ++ RUST_LOG=sccache::server=error 2025-12-04T09:05:30.0766177Z ++ sccache --zero-stats 2025-12-04T09:05:30.1890078Z Statistics zeroed. 2025-12-04T09:05:30.1894798Z ++ which ccache 2025-12-04T09:05:30.1906948Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *rocm* ]] 2025-12-04T09:05:30.1907327Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *s390x* ]] 2025-12-04T09:05:30.1907628Z + [[ -d /var/lib/jenkins/workspace ]] 2025-12-04T09:05:30.1909105Z ++ stat -c %u /var/lib/jenkins/workspace 2025-12-04T09:05:30.1925851Z + WORKSPACE_ORIGINAL_OWNER_ID=1000 2025-12-04T09:05:30.1926231Z + trap_add cleanup_workspace EXIT 2025-12-04T09:05:30.1926614Z + trap_add_cmd=cleanup_workspace 2025-12-04T09:05:30.1926868Z + shift 2025-12-04T09:05:30.1927071Z + for trap_add_name in "$@" 2025-12-04T09:05:30.1933044Z +++ trap -p EXIT 2025-12-04T09:05:30.1936194Z ++ eval 'extract_trap_cmd trap -- '\'' 2025-12-04T09:05:30.1936595Z sccache_epilogue'\'' EXIT' 2025-12-04T09:05:30.1936872Z +++ extract_trap_cmd trap -- ' 2025-12-04T09:05:30.1937118Z sccache_epilogue' EXIT 2025-12-04T09:05:30.1937332Z +++ printf '%s\n' ' 2025-12-04T09:05:30.1937540Z sccache_epilogue' 2025-12-04T09:05:30.1937765Z ++ printf '%s\n' cleanup_workspace 2025-12-04T09:05:30.1938689Z + trap -- ' 2025-12-04T09:05:30.1938910Z sccache_epilogue 2025-12-04T09:05:30.1939133Z cleanup_workspace' EXIT 2025-12-04T09:05:30.1939539Z + sudo chown -R jenkins /var/lib/jenkins/workspace 2025-12-04T09:05:31.1157422Z + git config --global --add safe.directory /var/lib/jenkins/workspace 2025-12-04T09:05:31.1179364Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *cuda* ]] 2025-12-04T09:05:31.1182846Z ++ python -c 'import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))' 2025-12-04T09:05:31.4780083Z + NUMBA_CUDA_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:05:31.4780775Z + '[' -n /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda ']' 2025-12-04T09:05:31.4785413Z +++ realpath .ci/pytorch/test.sh 2025-12-04T09:05:31.4796543Z ++ dirname /var/lib/jenkins/workspace/.ci/pytorch/test.sh 2025-12-04T09:05:31.4806481Z + NUMBA_PATCH=/var/lib/jenkins/workspace/.ci/pytorch/numba-cuda-13.patch 2025-12-04T09:05:31.4807124Z + pushd /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:05:31.4808202Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda ~/workspace 2025-12-04T09:05:31.4808603Z + patch -p4 2025-12-04T09:05:31.4822778Z patching file cudadrv/driver.py 2025-12-04T09:05:31.4823146Z Hunk #1 succeeded at 357 (offset -8 lines). 2025-12-04T09:05:31.4891504Z + popd 2025-12-04T09:05:31.4891705Z ~/workspace 2025-12-04T09:05:31.4891924Z + echo 'Environment variables:' 2025-12-04T09:05:31.4892197Z Environment variables: 2025-12-04T09:05:31.4892498Z + env 2025-12-04T09:05:31.4902670Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:05:31.4903140Z CONTINUE_THROUGH_ERROR=True 2025-12-04T09:05:31.4903459Z BUILD_ENVIRONMENT=linux-jammy-cuda12.8-py3.10-gcc11 2025-12-04T09:05:31.4904058Z VLLM_TEST_HUGGING_FACE_TOKEN=*** 2025-12-04T09:05:31.4904406Z HOSTNAME=c2ae7847e2aa 2025-12-04T09:05:31.4904991Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.4905578Z GITHUB_ACTION=__run_3 2025-12-04T09:05:31.4905817Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T09:05:31.4906095Z GITHUB_RUN_NUMBER=158165 2025-12-04T09:05:31.4906328Z TEST_CONFIG=default 2025-12-04T09:05:31.4906550Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T09:05:31.4906847Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2025-12-04T09:05:31.4907149Z SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:05:31.4907519Z SCRIBE_GRAPHQL_ACCESS_TOKEN=*** 2025-12-04T09:05:31.4907788Z GITHUB_TRIGGERING_ACTOR=huydhn 2025-12-04T09:05:31.4908058Z GITHUB_REF_TYPE=branch 2025-12-04T09:05:31.4908336Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:31.4908641Z XLA_CUDA= 2025-12-04T09:05:31.4908849Z NCCL_LIB_DIR=/usr/local/cuda/lib64/ 2025-12-04T09:05:31.4909242Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T09:05:31.4909749Z *** 2025-12-04T09:05:31.4909957Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T09:05:31.4910222Z GITHUB_ACTIONS=true 2025-12-04T09:05:31.4910458Z NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:31.4910766Z SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:05:31.4911139Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:31.4911499Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:31.4911951Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/trunk.yml@refs/heads/main 2025-12-04T09:05:31.4912317Z UCC_HOME=/usr 2025-12-04T09:05:31.4912491Z VERBOSE_TEST_LOGS=False 2025-12-04T09:05:31.4912683Z GITHUB_REF=refs/heads/main 2025-12-04T09:05:31.4912888Z SHARD_NUMBER=1 2025-12-04T09:05:31.4913070Z GITHUB_REF_PROTECTED=true 2025-12-04T09:05:31.4913273Z HOME=/var/lib/jenkins 2025-12-04T09:05:31.4913495Z GITHUB_API_URL=https://api.github.com 2025-12-04T09:05:31.4913745Z PYTORCH_TEST_RERUN_DISABLED_TESTS=1 2025-12-04T09:05:31.4914007Z UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152 2025-12-04T09:05:31.4914267Z USE_SYSTEM_NCCL=1 2025-12-04T09:05:31.4914431Z NUM_TEST_SHARDS=5 2025-12-04T09:05:31.4914596Z UCX_HOME=/usr 2025-12-04T09:05:31.4915058Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.4915857Z JOB_NAME=linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:05:31.4916605Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.4917218Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-12-04T09:05:31.4917604Z GITHUB_EVENT_NAME=schedule 2025-12-04T09:05:31.4917796Z DASHBOARD_TAG= 2025-12-04T09:05:31.4917958Z GITHUB_RUN_ID=19922768520 2025-12-04T09:05:31.4918150Z INSTALLED_OPENBLAS= 2025-12-04T09:05:31.4918617Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.4919122Z GITHUB_ACTOR=huydhn 2025-12-04T09:05:31.4919293Z PR_NUMBER= 2025-12-04T09:05:31.4919673Z DESIRED_CUDA=12.8.1 2025-12-04T09:05:31.4919843Z GITHUB_RUN_ATTEMPT=1 2025-12-04T09:05:31.4920165Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T09:05:31.4920423Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T09:05:31.4920686Z TERM=vt100 2025-12-04T09:05:31.4920841Z INSTALLED_VISION=yes 2025-12-04T09:05:31.4921019Z BRANCH=main 2025-12-04T09:05:31.4921185Z SCCACHE_REGION=us-east-1 2025-12-04T09:05:31.4921391Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T09:05:31.4921613Z BUILD_AOT_INDUCTOR_TEST= 2025-12-04T09:05:31.4921811Z CUDA_PATH=/usr/local/cuda 2025-12-04T09:05:31.4922244Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2025-12-04T09:05:31.4922759Z GITHUB_SERVER_URL=https://github.com 2025-12-04T09:05:31.4923014Z UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96 2025-12-04T09:05:31.4923246Z REENABLED_ISSUES= 2025-12-04T09:05:31.4923403Z DOCS= 2025-12-04T09:05:31.4923549Z SHLVL=1 2025-12-04T09:05:31.4923683Z MAX_JOBS=14 2025-12-04T09:05:31.4923842Z GITHUB_ACTOR_ID=475357 2025-12-04T09:05:31.4924088Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:31.4924359Z GITHUB_REF_NAME=main 2025-12-04T09:05:31.4924631Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:05:31.4924933Z GITHUB_JOB=test 2025-12-04T09:05:31.4925096Z NO_TEST_TIMEOUT=False 2025-12-04T09:05:31.4925273Z TD_DISTRIBUTED=False 2025-12-04T09:05:31.4925466Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T09:05:31.4925677Z GITHUB_RETENTION_DAYS=90 2025-12-04T09:05:31.4925858Z OPENSSL_DIR=/opt/openssl 2025-12-04T09:05:31.4926048Z GITHUB_ACTION_REPOSITORY= 2025-12-04T09:05:31.4926593Z PATH=/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:05:31.4927139Z GITHUB_BASE_REF= 2025-12-04T09:05:31.4927301Z INSTALLED_ACL= 2025-12-04T09:05:31.4927647Z ARTIFACTS_FILE_SUFFIX=test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853 2025-12-04T09:05:31.4928029Z CI=true 2025-12-04T09:05:31.4928191Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T09:05:31.4928433Z RUST_LOG=sccache::server=error 2025-12-04T09:05:31.4928622Z JOB_ID=57116084853 2025-12-04T09:05:31.4928776Z GITHUB_HEAD_REF= 2025-12-04T09:05:31.4928931Z GITHUB_ACTION_REF= 2025-12-04T09:05:31.4929133Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2025-12-04T09:05:31.4929377Z TEST_SHOWLOCALS=False 2025-12-04T09:05:31.4929548Z GITHUB_WORKFLOW=trunk 2025-12-04T09:05:31.4929725Z DEBIAN_FRONTEND=noninteractive 2025-12-04T09:05:31.4930156Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.4930608Z NO_TD=False 2025-12-04T09:05:31.4930776Z SKIP_SCCACHE_INITIALIZATION=1 2025-12-04T09:05:31.4930987Z NCCL_INCLUDE_DIR=/usr/local/cuda/include/ 2025-12-04T09:05:31.4931220Z _=/usr/bin/env 2025-12-04T09:05:31.4931477Z OLDPWD=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:05:31.4931848Z ++ python -c 'import site; print(site.getsitepackages()[0])' 2025-12-04T09:05:31.5029514Z + TORCH_INSTALL_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch 2025-12-04T09:05:31.5030182Z + TORCH_BIN_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:05:31.5030834Z + TORCH_LIB_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib 2025-12-04T09:05:31.5031565Z + TORCH_TEST_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/test 2025-12-04T09:05:31.5031961Z + BUILD_DIR=build 2025-12-04T09:05:31.5032189Z + BUILD_RENAMED_DIR=build_renamed 2025-12-04T09:05:31.5032460Z + BUILD_BIN_DIR=build/bin 2025-12-04T09:05:31.5032693Z + SHARD_NUMBER=1 2025-12-04T09:05:31.5032896Z + NUM_TEST_SHARDS=5 2025-12-04T09:05:31.5033125Z + export TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:05:31.5033394Z + TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:05:31.5033643Z + export VALGRIND=ON 2025-12-04T09:05:31.5033905Z + VALGRIND=ON 2025-12-04T09:05:31.5034577Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *clang9* ]] 2025-12-04T09:05:31.5035123Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *xpu* ]] 2025-12-04T09:05:31.5035444Z + detect_cuda_arch 2025-12-04T09:05:31.5035690Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *cuda* ]] 2025-12-04T09:05:31.5035997Z + command -v nvidia-smi 2025-12-04T09:05:31.5036242Z /usr/bin/nvidia-smi 2025-12-04T09:05:31.5040648Z ++ nvidia-smi --query-gpu=compute_cap --format=csv 2025-12-04T09:05:31.5042275Z ++ tail -n 1 2025-12-04T09:05:31.5247781Z + TORCH_CUDA_ARCH_LIST=8.9 2025-12-04T09:05:31.5248501Z + export TORCH_CUDA_ARCH_LIST 2025-12-04T09:05:31.5249348Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *s390x* ]] 2025-12-04T09:05:31.5250219Z + [[ 1 == \1 ]] 2025-12-04T09:05:31.5250601Z + ulimit -c 0 2025-12-04T09:05:31.5251070Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *bazel* ]] 2025-12-04T09:05:31.5255078Z ++ realpath build/custom_test_artifacts 2025-12-04T09:05:31.5263499Z + CUSTOM_TEST_ARTIFACT_BUILD_DIR=/var/lib/jenkins/workspace/build/custom_test_artifacts 2025-12-04T09:05:31.5263992Z + [[ -n '' ]] 2025-12-04T09:05:31.5264214Z + echo 'Environment variables' 2025-12-04T09:05:31.5264478Z Environment variables 2025-12-04T09:05:31.5264694Z + env 2025-12-04T09:05:31.5271161Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:05:31.5271932Z CONTINUE_THROUGH_ERROR=True 2025-12-04T09:05:31.5272347Z BUILD_ENVIRONMENT=linux-jammy-cuda12.8-py3.10-gcc11 2025-12-04T09:05:31.5272908Z VLLM_TEST_HUGGING_FACE_TOKEN=*** 2025-12-04T09:05:31.5273173Z HOSTNAME=c2ae7847e2aa 2025-12-04T09:05:31.5273757Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.5274354Z GITHUB_ACTION=__run_3 2025-12-04T09:05:31.5274595Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T09:05:31.5274873Z GITHUB_RUN_NUMBER=158165 2025-12-04T09:05:31.5275102Z TEST_CONFIG=default 2025-12-04T09:05:31.5275335Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T09:05:31.5275654Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2025-12-04T09:05:31.5275940Z SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:05:31.5276299Z SCRIBE_GRAPHQL_ACCESS_TOKEN=*** 2025-12-04T09:05:31.5276573Z GITHUB_TRIGGERING_ACTOR=huydhn 2025-12-04T09:05:31.5276826Z GITHUB_REF_TYPE=branch 2025-12-04T09:05:31.5277057Z TORCH_CUDA_ARCH_LIST=8.9 2025-12-04T09:05:31.5277344Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:31.5277661Z XLA_CUDA= 2025-12-04T09:05:31.5277875Z NCCL_LIB_DIR=/usr/local/cuda/lib64/ 2025-12-04T09:05:31.5278339Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T09:05:31.5278629Z *** 2025-12-04T09:05:31.5278819Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T09:05:31.5279080Z GITHUB_ACTIONS=true 2025-12-04T09:05:31.5279322Z NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:05:31.5279630Z SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:05:31.5280128Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:31.5280492Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:31.5280986Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/trunk.yml@refs/heads/main 2025-12-04T09:05:31.5281414Z UCC_HOME=/usr 2025-12-04T09:05:31.5281690Z TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:05:31.5281980Z VERBOSE_TEST_LOGS=False 2025-12-04T09:05:31.5282309Z GITHUB_REF=refs/heads/main 2025-12-04T09:05:31.5282573Z SHARD_NUMBER=1 2025-12-04T09:05:31.5282753Z GITHUB_REF_PROTECTED=true 2025-12-04T09:05:31.5282941Z HOME=/var/lib/jenkins 2025-12-04T09:05:31.5283144Z GITHUB_API_URL=https://api.github.com 2025-12-04T09:05:31.5283393Z PYTORCH_TEST_RERUN_DISABLED_TESTS=1 2025-12-04T09:05:31.5283641Z UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152 2025-12-04T09:05:31.5283886Z USE_SYSTEM_NCCL=1 2025-12-04T09:05:31.5284051Z NUM_TEST_SHARDS=5 2025-12-04T09:05:31.5284202Z UCX_HOME=/usr 2025-12-04T09:05:31.5284618Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.5285553Z JOB_NAME=linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:05:31.5286414Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.5286988Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-12-04T09:05:31.5287345Z GITHUB_EVENT_NAME=schedule 2025-12-04T09:05:31.5287533Z DASHBOARD_TAG= 2025-12-04T09:05:31.5287692Z GITHUB_RUN_ID=19922768520 2025-12-04T09:05:31.5287878Z INSTALLED_OPENBLAS= 2025-12-04T09:05:31.5288314Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.5288790Z GITHUB_ACTOR=huydhn 2025-12-04T09:05:31.5288944Z PR_NUMBER= 2025-12-04T09:05:31.5289094Z DESIRED_CUDA=12.8.1 2025-12-04T09:05:31.5289261Z GITHUB_RUN_ATTEMPT=1 2025-12-04T09:05:31.5289423Z VALGRIND=ON 2025-12-04T09:05:31.5289602Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T09:05:31.5289851Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T09:05:31.5290099Z TERM=vt100 2025-12-04T09:05:31.5290245Z INSTALLED_VISION=yes 2025-12-04T09:05:31.5290414Z BRANCH=main 2025-12-04T09:05:31.5290571Z SCCACHE_REGION=us-east-1 2025-12-04T09:05:31.5290758Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T09:05:31.5290962Z BUILD_AOT_INDUCTOR_TEST= 2025-12-04T09:05:31.5291153Z CUDA_PATH=/usr/local/cuda 2025-12-04T09:05:31.5291526Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2025-12-04T09:05:31.5291942Z GITHUB_SERVER_URL=https://github.com 2025-12-04T09:05:31.5292199Z UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96 2025-12-04T09:05:31.5292576Z REENABLED_ISSUES= 2025-12-04T09:05:31.5292731Z DOCS= 2025-12-04T09:05:31.5292871Z SHLVL=1 2025-12-04T09:05:31.5293003Z MAX_JOBS=14 2025-12-04T09:05:31.5293157Z GITHUB_ACTOR_ID=475357 2025-12-04T09:05:31.5293400Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:05:31.5293681Z GITHUB_REF_NAME=main 2025-12-04T09:05:31.5293946Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:05:31.5294248Z GITHUB_JOB=test 2025-12-04T09:05:31.5294410Z NO_TEST_TIMEOUT=False 2025-12-04T09:05:31.5294575Z TD_DISTRIBUTED=False 2025-12-04T09:05:31.5294766Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T09:05:31.5294978Z GITHUB_RETENTION_DAYS=90 2025-12-04T09:05:31.5295160Z OPENSSL_DIR=/opt/openssl 2025-12-04T09:05:31.5295346Z GITHUB_ACTION_REPOSITORY= 2025-12-04T09:05:31.5295885Z PATH=/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:05:31.5296425Z GITHUB_BASE_REF= 2025-12-04T09:05:31.5296580Z INSTALLED_ACL= 2025-12-04T09:05:31.5296920Z ARTIFACTS_FILE_SUFFIX=test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853 2025-12-04T09:05:31.5297301Z CI=true 2025-12-04T09:05:31.5297465Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T09:05:31.5297711Z RUST_LOG=sccache::server=error 2025-12-04T09:05:31.5297907Z JOB_ID=57116084853 2025-12-04T09:05:31.5298057Z GITHUB_HEAD_REF= 2025-12-04T09:05:31.5298216Z GITHUB_ACTION_REF= 2025-12-04T09:05:31.5298421Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2025-12-04T09:05:31.5298662Z TEST_SHOWLOCALS=False 2025-12-04T09:05:31.5298833Z GITHUB_WORKFLOW=trunk 2025-12-04T09:05:31.5299010Z DEBIAN_FRONTEND=noninteractive 2025-12-04T09:05:31.5299452Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_1b6fc04e-e3cc-4e82-b722-032fce93309e 2025-12-04T09:05:31.5299892Z NO_TD=False 2025-12-04T09:05:31.5300055Z SKIP_SCCACHE_INITIALIZATION=1 2025-12-04T09:05:31.5300270Z NCCL_INCLUDE_DIR=/usr/local/cuda/include/ 2025-12-04T09:05:31.5300583Z OLDPWD=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:05:31.5300877Z _=/usr/bin/env 2025-12-04T09:05:31.5301037Z + echo 'Testing pytorch' 2025-12-04T09:05:31.5301209Z Testing pytorch 2025-12-04T09:05:31.5301378Z + export LANG=C.UTF-8 2025-12-04T09:05:31.5301648Z + LANG=C.UTF-8 2025-12-04T09:05:31.5301876Z + PR_NUMBER= 2025-12-04T09:05:31.5302040Z + [[ default == \d\e\f\a\u\l\t ]] 2025-12-04T09:05:31.5302249Z + export CUDA_VISIBLE_DEVICES=0 2025-12-04T09:05:31.5302450Z + CUDA_VISIBLE_DEVICES=0 2025-12-04T09:05:31.5302646Z + export HIP_VISIBLE_DEVICES=0 2025-12-04T09:05:31.5302843Z + HIP_VISIBLE_DEVICES=0 2025-12-04T09:05:31.5303028Z + [[ default == \d\i\s\t\r\i\b\u\t\e\d ]] 2025-12-04T09:05:31.5303239Z + [[ default == \s\l\o\w ]] 2025-12-04T09:05:31.5303486Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *slow-gradcheck* ]] 2025-12-04T09:05:31.5303790Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *cuda* ]] 2025-12-04T09:05:31.5304054Z + export PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T09:05:31.5304300Z + PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T09:05:31.5304521Z + [[ default == *crossref* ]] 2025-12-04T09:05:31.5304739Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *rocm* ]] 2025-12-04T09:05:31.5305012Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *xpu* ]] 2025-12-04T09:05:31.5305301Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *-bazel-* ]] 2025-12-04T09:05:31.5305547Z + pip_install ninja==1.10.2 2025-12-04T09:05:31.5305800Z + pip_install_pkg='python3 -m pip install --progress-bar off' 2025-12-04T09:05:31.5306151Z + python3 -m pip install --progress-bar off ninja==1.10.2 2025-12-04T09:05:32.3794164Z Collecting ninja==1.10.2 2025-12-04T09:05:32.4016476Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB) 2025-12-04T09:05:32.4376935Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) 2025-12-04T09:05:32.7825421Z Installing collected packages: ninja 2025-12-04T09:05:32.7825849Z Attempting uninstall: ninja 2025-12-04T09:05:32.7832089Z Found existing installation: ninja 1.11.1.4 2025-12-04T09:05:32.7853740Z Uninstalling ninja-1.11.1.4: 2025-12-04T09:05:32.8017330Z Successfully uninstalled ninja-1.11.1.4 2025-12-04T09:05:32.8757278Z Successfully installed ninja-1.10.2 2025-12-04T09:05:32.9218898Z + export PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:05:32.9220686Z + PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:05:32.9221770Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *aarch64* ]] 2025-12-04T09:05:32.9222415Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *asan* ]] 2025-12-04T09:05:32.9222808Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *-debug* ]] 2025-12-04T09:05:32.9223283Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *-bazel-* ]] 2025-12-04T09:05:32.9223835Z + echo 'We are not in debug mode: linux-jammy-cuda12.8-py3.10-gcc11. Expect the assertion to pass' 2025-12-04T09:05:32.9224503Z We are not in debug mode: linux-jammy-cuda12.8-py3.10-gcc11. Expect the assertion to pass 2025-12-04T09:05:32.9224984Z + cd test 2025-12-04T09:05:32.9225310Z + python -c 'import torch; torch._C._crash_if_debug_asserts_fail(424242)' 2025-12-04T09:05:34.3779267Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]] 2025-12-04T09:05:34.3779692Z + [[ default == \n\o\g\p\u\_\A\V\X\5\1\2 ]] 2025-12-04T09:05:34.3780069Z + [[ default == \l\e\g\a\c\y\_\n\v\i\d\i\a\_\d\r\i\v\e\r ]] 2025-12-04T09:05:34.3784394Z + DYNAMO_BENCHMARK_FLAGS=() 2025-12-04T09:05:34.3784751Z + [[ default == *pr_time_benchmarks* ]] 2025-12-04T09:05:34.3785095Z + [[ default == *dynamo_eager* ]] 2025-12-04T09:05:34.3785382Z + [[ default == *aot_eager* ]] 2025-12-04T09:05:34.3785657Z + [[ default == *aot_inductor* ]] 2025-12-04T09:05:34.3785950Z + [[ default == *max_autotune_inductor* ]] 2025-12-04T09:05:34.3786252Z + [[ default == *inductor* ]] 2025-12-04T09:05:34.3786556Z + [[ default == *dynamic* ]] 2025-12-04T09:05:34.3786813Z + [[ default == *cpu* ]] 2025-12-04T09:05:34.3787065Z + [[ default == *xpu* ]] 2025-12-04T09:05:34.3787747Z + DYNAMO_BENCHMARK_FLAGS+=(--device cuda) 2025-12-04T09:05:34.3804978Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *libtorch* ]] 2025-12-04T09:05:34.3805461Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *-bazel-* ]] 2025-12-04T09:05:34.3808052Z + cd test 2025-12-04T09:05:34.3809087Z + python -c 'import torch; print(torch.__config__.show())' 2025-12-04T09:05:35.8066495Z PyTorch built with: 2025-12-04T09:05:35.8066856Z - GCC 11.4 2025-12-04T09:05:35.8067120Z - C++ Version: 201703 2025-12-04T09:05:35.8067788Z - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T09:05:35.8068665Z - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T09:05:35.8069183Z - OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T09:05:35.8069576Z - LAPACK is enabled (usually provided by MKL) 2025-12-04T09:05:35.8069957Z - NNPACK is enabled 2025-12-04T09:05:35.8070250Z - CPU capability usage: AVX2 2025-12-04T09:05:35.8070606Z - CUDA Runtime 12.8 2025-12-04T09:05:35.8071186Z - NVCC architecture flags: -gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_89,code=sm_89 2025-12-04T09:05:35.8071826Z - CuDNN 91.0.2 (built against CUDA 12.9) 2025-12-04T09:05:35.8076701Z - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=35b7a9a26c5923d98aebaa41a031dae21788a9ee, CUDA_VERSION=12.8, CUDNN_VERSION=9.10.2, CXX_COMPILER=/opt/cache/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Werror -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=ON, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF, 2025-12-04T09:05:35.8081486Z 2025-12-04T09:05:36.0889330Z + cd test 2025-12-04T09:05:36.0889735Z + python -c 'import torch; print(torch.__config__.parallel_info())' 2025-12-04T09:05:37.2663650Z ATen/Parallel: 2025-12-04T09:05:37.2663984Z at::get_num_threads() : 8 2025-12-04T09:05:37.2664280Z at::get_num_interop_threads() : 8 2025-12-04T09:05:37.2664577Z OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T09:05:37.2664846Z omp_get_max_threads() : 8 2025-12-04T09:05:37.2665379Z Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T09:05:37.2665975Z mkl_get_max_threads() : 8 2025-12-04T09:05:37.2666326Z Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T09:05:37.2666738Z std::thread::hardware_concurrency() : 16 2025-12-04T09:05:37.2667017Z Environment variables: 2025-12-04T09:05:37.2667263Z OMP_NUM_THREADS : [not set] 2025-12-04T09:05:37.2667503Z MKL_NUM_THREADS : [not set] 2025-12-04T09:05:37.2667751Z ATen parallel backend: OpenMP 2025-12-04T09:05:37.2667924Z 2025-12-04T09:05:37.5035434Z + [[ default == *numpy_2* ]] 2025-12-04T09:05:37.5036095Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *aarch64* ]] 2025-12-04T09:05:37.5036881Z + [[ default == *backward* ]] 2025-12-04T09:05:37.5037326Z + [[ default == *libtorch_agnostic_targetting* ]] 2025-12-04T09:05:37.5037767Z + [[ default == *xla* ]] 2025-12-04T09:05:37.5038097Z + [[ default == *vllm* ]] 2025-12-04T09:05:37.5038436Z + [[ default == *executorch* ]] 2025-12-04T09:05:37.5038793Z + [[ default == \j\i\t\_\l\e\g\a\c\y ]] 2025-12-04T09:05:37.5039826Z + [[ default == \q\u\a\n\t\i\z\a\t\i\o\n ]] 2025-12-04T09:05:37.5040313Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *libtorch* ]] 2025-12-04T09:05:37.5040773Z + [[ default == distributed ]] 2025-12-04T09:05:37.5041137Z + [[ default == *operator_benchmark* ]] 2025-12-04T09:05:37.5041553Z + [[ default == *operator_microbenchmark* ]] 2025-12-04T09:05:37.5041975Z + [[ default == *attention_microbenchmark* ]] 2025-12-04T09:05:37.5042392Z + [[ default == *inductor_distributed* ]] 2025-12-04T09:05:37.5042793Z + [[ default == *inductor-halide* ]] 2025-12-04T09:05:37.5043177Z + [[ default == *inductor-pallas* ]] 2025-12-04T09:05:37.5043552Z + [[ default == *inductor-triton-cpu* ]] 2025-12-04T09:05:37.5043911Z + [[ default == *inductor-micro-benchmark* ]] 2025-12-04T09:05:37.5044280Z + [[ default == *aoti_cross_compile_for_windows* ]] 2025-12-04T09:05:37.5044623Z + [[ default == *huggingface* ]] 2025-12-04T09:05:37.5044903Z + [[ default == *timm* ]] 2025-12-04T09:05:37.5045169Z + [[ default == cachebench ]] 2025-12-04T09:05:37.5045474Z + [[ default == verify_cachebench ]] 2025-12-04T09:05:37.5045777Z + [[ default == *torchbench* ]] 2025-12-04T09:05:37.5046080Z + [[ default == *inductor_cpp_wrapper* ]] 2025-12-04T09:05:37.5046403Z + [[ default == *inductor_core* ]] 2025-12-04T09:05:37.5046690Z + [[ default == *inductor* ]] 2025-12-04T09:05:37.5046964Z + [[ default == *einops* ]] 2025-12-04T09:05:37.5047235Z + [[ default == *dynamo_core* ]] 2025-12-04T09:05:37.5047516Z + [[ default == *dynamo_wrapped* ]] 2025-12-04T09:05:37.5047861Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *rocm* ]] 2025-12-04T09:05:37.5048196Z + [[ 1 == 1 ]] 2025-12-04T09:05:37.5048407Z + [[ 5 -gt 1 ]] 2025-12-04T09:05:37.5048661Z + test_lazy_tensor_meta_reference_disabled 2025-12-04T09:05:37.5049067Z + export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1 2025-12-04T09:05:37.5049507Z + TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1 2025-12-04T09:05:37.5049952Z + echo 'Testing lazy tensor operations without meta reference' 2025-12-04T09:05:37.5050424Z Testing lazy tensor operations without meta reference 2025-12-04T09:05:37.5050917Z + python test/run_test.py --include lazy/test_ts_opinfo.py --verbose 2025-12-04T09:05:41.7883611Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:05:41.8335766Z Ignoring disabled issues: [''] 2025-12-04T09:05:41.8413625Z Found test times from artifacts 2025-12-04T09:05:41.8730510Z Found test times from artifacts 2025-12-04T09:05:41.8740295Z Running all tests 2025-12-04T09:05:41.8743604Z Running parallel tests on 3 processes 2025-12-04T09:05:41.8743993Z Name: tests to run (est. time: 0.01min) 2025-12-04T09:05:41.8744314Z Serial tests (0): 2025-12-04T09:05:41.8744574Z Parallel tests (1): 2025-12-04T09:05:41.8744849Z lazy/test_ts_opinfo 1/1 2025-12-04T09:05:41.8745102Z Name: excluded (est. time: 0.0min) 2025-12-04T09:05:41.8745343Z Serial tests (0): 2025-12-04T09:05:41.8745566Z Parallel tests (0): 2025-12-04T09:05:41.8745909Z Running lazy/test_ts_opinfo 1/1 ... [2025-12-04 09:05:41.874384][785.538203619] 2025-12-04T09:05:41.8746295Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:05:41.8749699Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'lazy/test_ts_opinfo.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:05:41.874718] 2025-12-04T09:05:45.8437253Z 2025-12-04T09:05:45.8438398Z lazy/test_ts_opinfo 1/1 was successful, full logs can be found in artifacts with path test/test-reports/lazy.test_ts_opinfo_1.1_784637acdd02e0b4_.log 2025-12-04T09:05:45.8439311Z Running 0 items in this shard: 2025-12-04T09:05:45.8439529Z 2025-12-04T09:05:45.8439844Z Finished lazy/test_ts_opinfo 1/1 ... [2025-12-04 09:05:45.843652][789.507471179], took 0.07min 2025-12-04T09:05:45.8445451Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-94fdd26a21107d16.xml 2025-12-04T09:05:46.2420260Z Uploading artifacts took 0.12 seconds 2025-12-04T09:05:49.1592879Z Running lazy/test_ts_opinfo 1/1 ... [2025-12-04 09:05:49.158782][792.822598313] 2025-12-04T09:05:49.1593559Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:05:49.1595279Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'lazy/test_ts_opinfo.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:05:49.159198] 2025-12-04T09:05:53.3021236Z 2025-12-04T09:05:53.3022136Z lazy/test_ts_opinfo 1/1 was successful, full logs can be found in artifacts with path test/test-reports/lazy.test_ts_opinfo_1.1_7a32b03ee4eba52c_.log 2025-12-04T09:05:53.3022932Z Running 0 items in this shard: 2025-12-04T09:05:53.3023157Z 2025-12-04T09:05:53.3023458Z Finished lazy/test_ts_opinfo 1/1 ... [2025-12-04 09:05:53.302025][796.965840518], took 0.07min 2025-12-04T09:05:53.3035158Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-4a097beb92ace568.xml 2025-12-04T09:05:54.1933740Z Running test batch 'tests to run' cost 12.32 seconds 2025-12-04T09:05:54.8056363Z 2025-12-04T09:05:54.8056784Z real 0m17.302s 2025-12-04T09:05:54.8057053Z user 0m23.898s 2025-12-04T09:05:54.8057264Z sys 0m9.436s 2025-12-04T09:05:54.8057581Z + export -n TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE 2025-12-04T09:05:54.8057964Z + test_without_numpy 2025-12-04T09:05:54.8061063Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:05:54.8073599Z + pushd .ci/pytorch 2025-12-04T09:05:54.8073891Z ~/workspace/.ci/pytorch ~/workspace 2025-12-04T09:05:54.8074748Z + python -c 'import sys;sys.path.insert(0, '\''fake_numpy'\'');from unittest import TestCase;import torch;x=torch.randn(3,3);TestCase().assertRaises(RuntimeError, lambda: x.numpy())' 2025-12-04T09:05:55.5033375Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:283: UserWarning: Failed to initialize NumPy: Sorry PyTorch, but our NumPy is in the other folder (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/utils/tensor_numpy.cpp:84.) 2025-12-04T09:05:55.5035233Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T09:05:56.1599742Z + python -c 'import sys;sys.path.insert(0, '\''fake_numpy'\'');import torch;print(torch.tensor([torch.tensor(0.), torch.tensor(1.)]))' 2025-12-04T09:05:56.8596251Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:283: UserWarning: Failed to initialize NumPy: Sorry PyTorch, but our NumPy is in the other folder (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/utils/tensor_numpy.cpp:84.) 2025-12-04T09:05:56.8597983Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T09:05:57.2707469Z tensor([0., 1.]) 2025-12-04T09:05:57.4946643Z + [[ default == *dynamo_wrapped* ]] 2025-12-04T09:05:57.4947310Z + python -c 'import sys;sys.path.insert(0, '\''fake_numpy'\'');import torch; import torch.onnx' 2025-12-04T09:05:58.1937588Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:283: UserWarning: Failed to initialize NumPy: Sorry PyTorch, but our NumPy is in the other folder (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/utils/tensor_numpy.cpp:84.) 2025-12-04T09:05:58.1939269Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T09:05:58.8334656Z + popd 2025-12-04T09:05:58.8334922Z ~/workspace 2025-12-04T09:05:58.8335134Z + install_torchvision 2025-12-04T09:05:58.8335373Z + local orig_preload 2025-12-04T09:05:58.8335581Z + local commit 2025-12-04T09:05:58.8338776Z ++ get_pinned_commit vision 2025-12-04T09:05:58.8339085Z ++ cat .github/ci_commit_pins/vision.txt 2025-12-04T09:05:58.8354424Z + commit=617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:05:58.8354936Z + orig_preload= 2025-12-04T09:05:58.8355222Z + '[' -n '' ']' 2025-12-04T09:05:58.8355495Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *cuda* ]] 2025-12-04T09:05:58.8355818Z + export FORCE_CUDA=1 2025-12-04T09:05:58.8356034Z + FORCE_CUDA=1 2025-12-04T09:05:58.8356233Z + export WITH_CUDA=1 2025-12-04T09:05:58.8356448Z + WITH_CUDA=1 2025-12-04T09:05:58.8356980Z + pip_build_and_install git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e dist/vision 2025-12-04T09:05:58.8357762Z + local build_target=git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:05:58.8358258Z + local wheel_dir=dist/vision 2025-12-04T09:05:58.8358493Z + local found_whl=0 2025-12-04T09:05:58.8358704Z + for file in "${wheel_dir}"/*.whl 2025-12-04T09:05:58.8358971Z + [[ -f dist/vision/*.whl ]] 2025-12-04T09:05:58.8359174Z + '[' 0 == 0 ']' 2025-12-04T09:05:58.8359664Z + python3 -m pip wheel --no-build-isolation --no-deps -w dist/vision git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:05:59.1192780Z Collecting git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:05:59.1196646Z Cloning https://github.com/pytorch/vision.git (to revision 617079d944b0e72632311c30ae2bbdf1168b901e) to /tmp/pip-req-build-3hh1vpem 2025-12-04T09:05:59.1220895Z Running command git clone --filter=blob:none --quiet https://github.com/pytorch/vision.git /tmp/pip-req-build-3hh1vpem 2025-12-04T09:06:00.5627461Z Running command git rev-parse -q --verify 'sha^617079d944b0e72632311c30ae2bbdf1168b901e' 2025-12-04T09:06:00.5652612Z Running command git fetch -q https://github.com/pytorch/vision.git 617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:06:00.6635613Z Resolved https://github.com/pytorch/vision.git to commit 617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:06:02.4738179Z Preparing metadata (pyproject.toml) ... [?25l- \ | done 2025-12-04T09:06:02.4779078Z [?25hBuilding wheels for collected packages: torchvision 2025-12-04T09:07:17.5226663Z Building wheel for torchvision (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-12-04T09:07:17.5255763Z [?25h Created wheel for torchvision: filename=torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl size=1786625 sha256=bbec47562a97db4c611e70e9b3a7ec2a4ff1afcd6a14edf82cf35a46d682d987 2025-12-04T09:07:17.5257403Z Stored in directory: /var/lib/jenkins/.cache/pip/wheels/12/b2/29/1f82685c5b5173629e1f36a9b93989ce92ce563e5fb91d27ac 2025-12-04T09:07:17.5290042Z Successfully built torchvision 2025-12-04T09:07:17.6244979Z + for file in "${wheel_dir}"/*.whl 2025-12-04T09:07:17.6245581Z + pip_install_whl dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:07:17.6246319Z + args=('dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl') 2025-12-04T09:07:17.6246804Z + local args 2025-12-04T09:07:17.6247228Z + [[ dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl == *\ * ]] 2025-12-04T09:07:17.6247733Z + for path in "${args[@]}" 2025-12-04T09:07:17.6248212Z + echo 'Installing dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl' 2025-12-04T09:07:17.6248920Z Installing dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:07:17.6249723Z + python3 -mpip install --no-index --no-deps dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:07:17.9191789Z Processing ./dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:07:17.9274214Z Installing collected packages: torchvision 2025-12-04T09:07:18.3390298Z Successfully installed torchvision-0.25.0a0+617079d 2025-12-04T09:07:18.3682872Z + '[' -n '' ']' 2025-12-04T09:07:18.3683991Z + test_python_shard 1 2025-12-04T09:07:18.3684414Z + [[ -z 5 ]] 2025-12-04T09:07:18.3686134Z + python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard 1 5 --verbose --upload-artifacts-while-running 2025-12-04T09:07:22.7292919Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:07:22.7371324Z Found test times from artifacts 2025-12-04T09:07:22.7693920Z Found test times from artifacts 2025-12-04T09:07:22.7703472Z Running all tests 2025-12-04T09:07:22.8213090Z Running parallel tests on 3 processes 2025-12-04T09:07:22.8216807Z Name: tests to run (est. time: 105.15min) 2025-12-04T09:07:22.8217254Z Serial tests (52): 2025-12-04T09:07:22.8217576Z inductor/test_flex_attention 1/6 2025-12-04T09:07:22.8217914Z inductor/test_flex_attention 2/6 2025-12-04T09:07:22.8218231Z inductor/test_flex_attention 3/6 2025-12-04T09:07:22.8218547Z inductor/test_flex_attention 4/6 2025-12-04T09:07:22.8218920Z inductor/test_flex_attention 5/6 2025-12-04T09:07:22.8219229Z inductor/test_flex_attention 6/6 2025-12-04T09:07:22.8219548Z dynamo/test_fake_distributed 1/1 2025-12-04T09:07:22.8219860Z test_ci_sanity_check_fail 1/1 2025-12-04T09:07:22.8220178Z inductor/test_max_autotune 1/1 2025-12-04T09:07:22.8220478Z doctests 1/1 2025-12-04T09:07:22.8220737Z inductor/test_cutlass_backend 1/1 2025-12-04T09:07:22.8221063Z inductor/test_benchmark_fusion 1/1 2025-12-04T09:07:22.8221406Z inductor/test_distributed_patterns 1/1 2025-12-04T09:07:22.8221746Z test_sort_and_select 1/1 2025-12-04T09:07:22.8222080Z test_privateuseone_python_backend 1/1 2025-12-04T09:07:22.8222448Z test_cpp_api_parity 1/1 2025-12-04T09:07:22.8222705Z test_extension_utils 1/1 2025-12-04T09:07:22.8222947Z test_show_pickle 1/1 2025-12-04T09:07:22.8223225Z test_torch 1/1 2025-12-04T09:07:22.8223543Z test_tensorexpr 1/1 2025-12-04T09:07:22.8223887Z test_utils 1/1 2025-12-04T09:07:22.8224148Z test_namedtuple_return_api 1/1 2025-12-04T09:07:22.8224416Z test_fake_tensor 1/1 2025-12-04T09:07:22.8224655Z test_multiprocessing 1/1 2025-12-04T09:07:22.8224892Z test_fx 1/1 2025-12-04T09:07:22.8225110Z test_autograd_fallback 1/1 2025-12-04T09:07:22.8225360Z test_autocast 1/1 2025-12-04T09:07:22.8225588Z test_python_dispatch 1/1 2025-12-04T09:07:22.8225835Z test_jit_disabled 1/1 2025-12-04T09:07:22.8226116Z test_cpp_extensions_mtia_backend 1/1 2025-12-04T09:07:22.8226542Z functorch/test_memory_efficient_fusion 1/1 2025-12-04T09:07:22.8226855Z test_tensor_creation_ops 1/1 2025-12-04T09:07:22.8227139Z test_cpp_extensions_stream_and_event 1/1 2025-12-04T09:07:22.8227429Z test_dispatch 1/1 2025-12-04T09:07:22.8227652Z nn/test_convolution 1/1 2025-12-04T09:07:22.8227907Z test_cpp_extensions_jit 1/1 2025-12-04T09:07:22.8228156Z test_nn 1/1 2025-12-04T09:07:22.8228373Z test_multiprocessing_spawn 1/1 2025-12-04T09:07:22.8228652Z nn/test_pooling 1/1 2025-12-04T09:07:22.8228886Z test_cuda_trace 1/1 2025-12-04T09:07:22.8229098Z test_native_mha 1/1 2025-12-04T09:07:22.8229336Z test_cuda_nvml_based_avail 1/1 2025-12-04T09:07:22.8229603Z test_mobile_optimizer 1/1 2025-12-04T09:07:22.8229846Z test_overrides 1/1 2025-12-04T09:07:22.8230078Z test_cuda_primary_ctx 1/1 2025-12-04T09:07:22.8230325Z test_reductions 1/1 2025-12-04T09:07:22.8230547Z test_spectral_ops 1/1 2025-12-04T09:07:22.8230813Z distributions/test_distributions 1/1 2025-12-04T09:07:22.8231107Z test_autoload_disable 1/1 2025-12-04T09:07:22.8231352Z test_autoload_enable 1/1 2025-12-04T09:07:22.8231615Z test_cpp_extensions_aot_ninja 1/1 2025-12-04T09:07:22.8231907Z test_cpp_extensions_aot_no_ninja 1/1 2025-12-04T09:07:22.8232198Z Parallel tests (15): 2025-12-04T09:07:22.8232449Z inductor/test_collective_autotuning 1/1 2025-12-04T09:07:22.8232765Z inductor/test_cpu_cpp_wrapper 1/1 2025-12-04T09:07:22.8233256Z inductor/test_autoheuristic 1/1 2025-12-04T09:07:22.8233631Z inductor/test_control_flow 5/5 2025-12-04T09:07:22.8233852Z inductor/test_mps_basic 1/1 2025-12-04T09:07:22.8234052Z test_ops_fwd_gradients 3/3 2025-12-04T09:07:22.8234234Z test_meta 1/4 2025-12-04T09:07:22.8234393Z test_meta 2/4 2025-12-04T09:07:22.8234560Z functorch/test_control_flow 2/4 2025-12-04T09:07:22.8234764Z functorch/test_control_flow 3/4 2025-12-04T09:07:22.8234955Z test_modules 1/2 2025-12-04T09:07:22.8235131Z optim/test_lrscheduler 1/1 2025-12-04T09:07:22.8235314Z test_sparse_csr 3/3 2025-12-04T09:07:22.8235494Z functorch/test_ops 4/9 2025-12-04T09:07:22.8235696Z functorch/test_ops 7/9 2025-12-04T09:07:22.8235878Z Name: excluded (est. time: 0.0min) 2025-12-04T09:07:22.8236077Z Serial tests (0): 2025-12-04T09:07:22.8236244Z Parallel tests (0): 2025-12-04T09:07:22.8236553Z Running inductor/test_flex_attention 1/6 ... [2025-12-04 09:07:22.822168][886.485988975] 2025-12-04T09:07:22.8236895Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:07:22.8237764Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=1', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:07:22.822491] 2025-12-04T09:14:57.0738683Z 2025-12-04T09:14:57.0739623Z PRINTING LOG FILE of inductor/test_flex_attention 1/6 (test/test-reports/inductor.test_flex_attention_1.6_88265ee86f308f18_.log) 2025-12-04T09:14:57.0789217Z Test results will be stored in test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8a3fc0e970c9a98f.xml 2025-12-04T09:14:57.0790469Z ============================= test session starts ============================== 2025-12-04T09:14:57.0791313Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:14:57.0792054Z cachedir: .pytest_cache 2025-12-04T09:14:57.0793008Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:14:57.0793792Z rootdir: /var/lib/jenkins/workspace 2025-12-04T09:14:57.0794222Z configfile: pytest.ini 2025-12-04T09:14:57.0794827Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:14:57.0795853Z collecting ... collected 763 items 2025-12-04T09:14:57.0796403Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:14:57.0819744Z Running 50 items in this shard: test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda 2025-12-04T09:14:57.0840763Z 2025-12-04T09:14:57.0841190Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda PASSED [10.9688s] [ 2%] 2025-12-04T09:14:57.0842142Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6616s] [ 2%] 2025-12-04T09:14:57.0843147Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.4222s] [ 2%] 2025-12-04T09:14:57.0844286Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.2706s] [ 2%] 2025-12-04T09:14:57.0845655Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5769s] [ 2%] 2025-12-04T09:14:57.0846843Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5634s] [ 2%] 2025-12-04T09:14:57.0847818Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7935s] [ 2%] 2025-12-04T09:14:57.0848676Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5336s] [ 2%] 2025-12-04T09:14:57.0849777Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.4584s] [ 2%] 2025-12-04T09:14:57.0850896Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8574s] [ 2%] 2025-12-04T09:14:57.0851999Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6267s] [ 2%] 2025-12-04T09:14:57.0853225Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9885s] [ 2%] 2025-12-04T09:14:57.0854194Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7940s] [ 2%] 2025-12-04T09:14:57.0855334Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6331s] [ 2%] 2025-12-04T09:14:57.0856550Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7056s] [ 2%] 2025-12-04T09:14:57.0857664Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6225s] [ 2%] 2025-12-04T09:14:57.0858358Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.0560s] [ 2%] 2025-12-04T09:14:57.0859416Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6752s] [ 2%] 2025-12-04T09:14:57.0860519Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6261s] [ 2%] 2025-12-04T09:14:57.0861595Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.3884s] [ 2%] 2025-12-04T09:14:57.0862874Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6865s] [ 2%] 2025-12-04T09:14:57.0863942Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9005s] [ 2%] 2025-12-04T09:14:57.0865013Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9845s] [ 2%] 2025-12-04T09:14:57.0866252Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.3737s] [ 2%] 2025-12-04T09:14:57.0867383Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9413s] [ 2%] 2025-12-04T09:14:57.0868597Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7159s] [ 2%] 2025-12-04T09:14:57.0869787Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7410s] [ 2%] 2025-12-04T09:14:57.0870530Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8815s] [ 2%] 2025-12-04T09:14:57.0871463Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7821s] [ 2%] 2025-12-04T09:14:57.0872336Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6779s] [ 2%] 2025-12-04T09:14:57.0873168Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7016s] [ 2%] 2025-12-04T09:14:57.0874213Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9423s] [ 2%] 2025-12-04T09:14:57.0875617Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9614s] [ 2%] 2025-12-04T09:14:57.0876716Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6491s] [ 2%] 2025-12-04T09:14:57.0877671Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9966s] [ 2%] 2025-12-04T09:14:57.0878595Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.0881s] [ 2%] 2025-12-04T09:14:57.0879790Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8634s] [ 2%] 2025-12-04T09:14:57.0880852Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.0042s] [ 2%] 2025-12-04T09:14:57.0881958Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.0146s] [ 2%] 2025-12-04T09:14:57.0882998Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.0749s] [ 2%] 2025-12-04T09:14:57.0884112Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.3859s] [ 2%] 2025-12-04T09:14:57.0885085Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9844s] [ 2%] 2025-12-04T09:14:57.0886101Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8614s] [ 2%] 2025-12-04T09:14:57.0887052Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9624s] [ 2%] 2025-12-04T09:14:57.0888044Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.1600s] [ 2%] 2025-12-04T09:14:57.0889139Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.0727s] [ 2%] 2025-12-04T09:14:57.0890241Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.1677s] [ 2%] 2025-12-04T09:14:57.0891320Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6989s] [ 2%] 2025-12-04T09:14:57.0892459Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8898s] [ 2%] 2025-12-04T09:14:57.0893607Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8961s] [ 2%] 2025-12-04T09:14:57.0894177Z 2025-12-04T09:14:57.0894308Z =================================== FAILURES =================================== 2025-12-04T09:14:57.0894917Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.0895508Z Traceback (most recent call last): 2025-12-04T09:14:57.0896248Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.0897013Z self.assertTrue( 2025-12-04T09:14:57.0897493Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.0898054Z raise self.failureException(msg) 2025-12-04T09:14:57.0898638Z AssertionError: False is not true : Log file /tmp/tmpmh6p_qgg/flex_attention_configs.json was not created 2025-12-04T09:14:57.0899071Z 2025-12-04T09:14:57.0899311Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.0899991Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.0900569Z 2025-12-04T09:14:57.0900860Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.0901525Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.0902040Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.0902411Z unimplemented [] 2025-12-04T09:14:57.0902707Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.0905290Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.0907651Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.0908141Z graph_break [] 2025-12-04T09:14:57.0908551Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.0910644Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.0912480Z current_size = base.storage().size() 2025-12-04T09:14:57.0912797Z Autotune Choices Stats: 2025-12-04T09:14:57.0914525Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.0917231Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.0918252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.0919289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.0921588Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.0925196Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.0928455Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.0932243Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.0936458Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.0940460Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.0942954Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.0943637Z Autotune Choices Stats: 2025-12-04T09:14:57.0946306Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.0949632Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.0951071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.0952750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.0955901Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.0959883Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.0963778Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.0967927Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.0972103Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.0976347Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.0980513Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.0984631Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.0988279Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.0992440Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.0995196Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.0995987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.0996465Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.0996802Z unimplemented [] 2025-12-04T09:14:57.0997140Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.0997758Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1000297Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.1002319Z graph_break [] 2025-12-04T09:14:57.1002720Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1003199Z Autotune Choices Stats: 2025-12-04T09:14:57.1005784Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.1008723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1009651Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1010705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1013478Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1017426Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1021186Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1024897Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1028400Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1032239Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1034710Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.1035414Z Autotune Choices Stats: 2025-12-04T09:14:57.1038035Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.1041210Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1042420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1043982Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1047022Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1051179Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1055070Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1058931Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1062528Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1066200Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1070157Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1073754Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1077952Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1081429Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.1084007Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.1084901Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.1085479Z Traceback (most recent call last): 2025-12-04T09:14:57.1086244Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.1086989Z self.assertTrue( 2025-12-04T09:14:57.1087514Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.1088116Z raise self.failureException(msg) 2025-12-04T09:14:57.1088783Z AssertionError: False is not true : Log file /tmp/tmp804t851x/flex_attention_configs.json was not created 2025-12-04T09:14:57.1089328Z 2025-12-04T09:14:57.1089554Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.1090364Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.1090930Z 2025-12-04T09:14:57.1091215Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.1091850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1092439Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1092789Z unimplemented [] 2025-12-04T09:14:57.1093049Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1094876Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.1097433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1098002Z graph_break [] 2025-12-04T09:14:57.1098389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1100480Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.1102041Z current_size = base.storage().size() 2025-12-04T09:14:57.1102272Z Autotune Choices Stats: 2025-12-04T09:14:57.1104853Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.1107888Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1108830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1109879Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1112527Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1116518Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1120494Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1124479Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1128477Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1131903Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1134498Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.1135181Z Autotune Choices Stats: 2025-12-04T09:14:57.1137976Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.1141422Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1142877Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1144560Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1146798Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1150640Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1154466Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1158148Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1162502Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1166815Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1171140Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1175399Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1179533Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1183668Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1186207Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.1187013Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1187491Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1187825Z unimplemented [] 2025-12-04T09:14:57.1188170Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1188816Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1191370Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.1193693Z graph_break [] 2025-12-04T09:14:57.1194087Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1194583Z Autotune Choices Stats: 2025-12-04T09:14:57.1197339Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.1200407Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1201337Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1202401Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1239929Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1243764Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1247827Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1252031Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1256475Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1260771Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1263966Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.1264490Z Autotune Choices Stats: 2025-12-04T09:14:57.1266294Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.1269451Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1270638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1272365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1275153Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1278490Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1282318Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1286143Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1290199Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1293970Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1297181Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1300387Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1303989Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1307730Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.1309973Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.1310836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1311146Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1311471Z unimplemented [] 2025-12-04T09:14:57.1311715Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1312147Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1314533Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.1316822Z graph_break [] 2025-12-04T09:14:57.1317345Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1317867Z Autotune Choices Stats: 2025-12-04T09:14:57.1320119Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.1322539Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1323291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1324336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1326614Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1329752Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1332952Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1336636Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1340490Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1344094Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1346653Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.1347309Z Autotune Choices Stats: 2025-12-04T09:14:57.1349412Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.1352256Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1353165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1354312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1357177Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1360434Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1363944Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1367320Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1370821Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1374640Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1378195Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1381959Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1385780Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1389301Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.1391589Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.1392389Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.1392927Z Traceback (most recent call last): 2025-12-04T09:14:57.1393537Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.1394142Z self.assertTrue( 2025-12-04T09:14:57.1394620Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.1394973Z raise self.failureException(msg) 2025-12-04T09:14:57.1395880Z AssertionError: False is not true : Log file /tmp/tmpz5j3y52j/flex_attention_configs.json was not created 2025-12-04T09:14:57.1396484Z 2025-12-04T09:14:57.1396716Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.1397448Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.1397885Z 2025-12-04T09:14:57.1398136Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.1398541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1398900Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1399181Z unimplemented [] 2025-12-04T09:14:57.1399392Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1401435Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.1403331Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1403876Z graph_break [] 2025-12-04T09:14:57.1404210Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1405911Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.1407322Z current_size = base.storage().size() 2025-12-04T09:14:57.1407671Z Autotune Choices Stats: 2025-12-04T09:14:57.1410073Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.1412697Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1413324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1414129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1416260Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1420307Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1423924Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1427860Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1431678Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1435084Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1437209Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.1437824Z Autotune Choices Stats: 2025-12-04T09:14:57.1440365Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.1443538Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1444590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1446152Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1448631Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1452618Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1456376Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1459344Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1463101Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1467664Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1471953Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1476390Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1480861Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1484989Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1487632Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.1488318Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1489007Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1489326Z unimplemented [] 2025-12-04T09:14:57.1489941Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1490620Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1493090Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.1495404Z graph_break [] 2025-12-04T09:14:57.1496068Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1496578Z Autotune Choices Stats: 2025-12-04T09:14:57.1499047Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.1501948Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1502748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1503715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1505706Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1509186Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1512215Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1515964Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1520077Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1523797Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1526352Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.1527239Z Autotune Choices Stats: 2025-12-04T09:14:57.1529781Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.1533249Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1534680Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1536554Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1540065Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1544275Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1548296Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1551956Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1556078Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1559487Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1563817Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1568037Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1572149Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1576412Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.1578833Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.1579493Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1579967Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1580270Z unimplemented [] 2025-12-04T09:14:57.1580603Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1581126Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1583705Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.1586231Z graph_break [] 2025-12-04T09:14:57.1586815Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1587433Z Autotune Choices Stats: 2025-12-04T09:14:57.1589838Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.1592346Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1593176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1594304Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1597382Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1601201Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1605323Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1609257Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1613265Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1617099Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1619732Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.1620480Z Autotune Choices Stats: 2025-12-04T09:14:57.1622916Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.1626489Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1628257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1630137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1633212Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1637449Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1641737Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1645951Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1650263Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1654604Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1659086Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1663639Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1668017Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1672320Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.1675070Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.1676096Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1676729Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1677223Z unimplemented [] 2025-12-04T09:14:57.1677692Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1678554Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1726622Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.1729814Z graph_break [] 2025-12-04T09:14:57.1737657Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1738434Z Autotune Choices Stats: 2025-12-04T09:14:57.1741117Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.1744270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1745543Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1746960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1749732Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1753859Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1758043Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1761017Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1764572Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1767885Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1770559Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.1771403Z Autotune Choices Stats: 2025-12-04T09:14:57.1774522Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.1777577Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1779000Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1780440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1783803Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1787765Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1790634Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1793989Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1798033Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1801862Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1804685Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1807848Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1811514Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1815166Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.1817343Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.1818337Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.1819163Z Traceback (most recent call last): 2025-12-04T09:14:57.1820127Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.1820930Z self.assertTrue( 2025-12-04T09:14:57.1821523Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.1822239Z raise self.failureException(msg) 2025-12-04T09:14:57.1822954Z AssertionError: False is not true : Log file /tmp/tmpe9s264og/flex_attention_configs.json was not created 2025-12-04T09:14:57.1823677Z 2025-12-04T09:14:57.1824081Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.1825229Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.1825908Z 2025-12-04T09:14:57.1826270Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.1827167Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1827934Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1828355Z unimplemented [] 2025-12-04T09:14:57.1828634Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1831455Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.1834247Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1834989Z graph_break [] 2025-12-04T09:14:57.1835460Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1836974Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.1838811Z current_size = base.storage().size() 2025-12-04T09:14:57.1839141Z Autotune Choices Stats: 2025-12-04T09:14:57.1841387Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.1844158Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1845199Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1846237Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1848435Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1852436Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1855865Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1859251Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1863802Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1867882Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1870470Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.1871450Z Autotune Choices Stats: 2025-12-04T09:14:57.1873873Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.1877387Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1879144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1880717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1883895Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1887951Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1892377Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1896506Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1900274Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1903944Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1907606Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1911895Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1915845Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1919000Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1921476Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.1922184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.1922823Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.1923324Z unimplemented [] 2025-12-04T09:14:57.1923914Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.1924720Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.1927248Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.1929742Z graph_break [] 2025-12-04T09:14:57.1930314Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.1930912Z Autotune Choices Stats: 2025-12-04T09:14:57.1933219Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.1935441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1936102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1936819Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1938406Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1941543Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1945167Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1948847Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.1952491Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.1955973Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1958362Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.1959265Z Autotune Choices Stats: 2025-12-04T09:14:57.1962276Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.1965407Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.1967134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.1969107Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.1972486Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1977052Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.1980717Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.1984154Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1988342Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.1992696Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.1996799Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2001046Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2005430Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2009923Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2012896Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.2013749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2014473Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2014942Z unimplemented [] 2025-12-04T09:14:57.2015433Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2016179Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2018861Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2021308Z graph_break [] 2025-12-04T09:14:57.2022145Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2022745Z Autotune Choices Stats: 2025-12-04T09:14:57.2025509Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2028624Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2029725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2030895Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2033697Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2037787Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2042083Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2046366Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2050455Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2054671Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2057337Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.2058219Z Autotune Choices Stats: 2025-12-04T09:14:57.2060984Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2064637Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2066195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2068009Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2071255Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2075822Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2080130Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2084404Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2088299Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2091872Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2095456Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2099027Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2103207Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2106370Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2109010Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.2110004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2110473Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2110830Z unimplemented [] 2025-12-04T09:14:57.2111159Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2111822Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2114470Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2116927Z graph_break [] 2025-12-04T09:14:57.2117475Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2118209Z Autotune Choices Stats: 2025-12-04T09:14:57.2120913Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2123689Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2124849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2125962Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2128058Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2131851Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2135977Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2140149Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2143136Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2146722Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2148363Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.2148860Z Autotune Choices Stats: 2025-12-04T09:14:57.2150520Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2152598Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2153530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2154572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2156634Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2159230Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2162274Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2164845Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2167318Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2169836Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2172503Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2175101Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2177737Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2180241Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2181835Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.2182414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2182776Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2183105Z unimplemented [] 2025-12-04T09:14:57.2183421Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2183858Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2185423Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2186856Z graph_break [] 2025-12-04T09:14:57.2187188Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2187597Z Autotune Choices Stats: 2025-12-04T09:14:57.2189187Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2191001Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2191677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2192392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2194067Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2196600Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2198991Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2201413Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2203808Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2206173Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2207765Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.2208239Z Autotune Choices Stats: 2025-12-04T09:14:57.2209875Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2211937Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2212891Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2214105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2216065Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2218550Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2221034Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2223492Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2225965Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2285034Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2287609Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2291311Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2294275Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2297164Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2298816Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.2299411Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.2299840Z Traceback (most recent call last): 2025-12-04T09:14:57.2300469Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.2301065Z self.assertTrue( 2025-12-04T09:14:57.2301431Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.2301958Z raise self.failureException(msg) 2025-12-04T09:14:57.2302467Z AssertionError: False is not true : Log file /tmp/tmpdl7gvigl/flex_attention_configs.json was not created 2025-12-04T09:14:57.2302814Z 2025-12-04T09:14:57.2302981Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.2303663Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.2304060Z 2025-12-04T09:14:57.2304257Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.2304717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2305174Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2305461Z unimplemented [] 2025-12-04T09:14:57.2305762Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2307368Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.2308905Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2309392Z graph_break [] 2025-12-04T09:14:57.2310219Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2315170Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.2317507Z current_size = base.storage().size() 2025-12-04T09:14:57.2317975Z Autotune Choices Stats: 2025-12-04T09:14:57.2320653Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.2323566Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2324322Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2324925Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2326428Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2328692Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2331047Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2333433Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2335693Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2338214Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2339622Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.2340017Z Autotune Choices Stats: 2025-12-04T09:14:57.2341542Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.2343568Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2344392Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2345333Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2347116Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2349458Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2351778Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2354182Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2356573Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2358885Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2361200Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2363977Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2366609Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2369048Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2370647Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.2371215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2371598Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2371920Z unimplemented [] 2025-12-04T09:14:57.2372251Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2372759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2374636Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2376068Z graph_break [] 2025-12-04T09:14:57.2376419Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2376799Z Autotune Choices Stats: 2025-12-04T09:14:57.2378449Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2380253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2380940Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2381661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2383272Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2385729Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2388107Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2390526Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2392998Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2395456Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2397060Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.2397547Z Autotune Choices Stats: 2025-12-04T09:14:57.2399144Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2401201Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2402123Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2403170Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2405143Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2407615Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2410208Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2412859Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2415366Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2417871Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2420332Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2422758Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2425270Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2427762Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2429316Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.2429893Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2430435Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2430802Z unimplemented [] 2025-12-04T09:14:57.2431100Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2431552Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2433151Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2434548Z graph_break [] 2025-12-04T09:14:57.2434827Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2435264Z Autotune Choices Stats: 2025-12-04T09:14:57.2436853Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2438616Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2439319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2440023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2441628Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2444138Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2446531Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2449048Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2451536Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2453983Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2455578Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.2456036Z Autotune Choices Stats: 2025-12-04T09:14:57.2457667Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2459743Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2460674Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2461971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2463976Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2466488Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2469175Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2471797Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2474264Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2476822Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2479288Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2481777Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2484248Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2486892Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2488585Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.2489155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2489531Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2489845Z unimplemented [] 2025-12-04T09:14:57.2490162Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2490605Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2492216Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2493748Z graph_break [] 2025-12-04T09:14:57.2494024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2494563Z Autotune Choices Stats: 2025-12-04T09:14:57.2496168Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2497988Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2498648Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2499313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2500958Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2503388Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2505851Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2508368Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2510741Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2513118Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2514777Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.2515261Z Autotune Choices Stats: 2025-12-04T09:14:57.2516946Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2518978Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2519913Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2521020Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2522906Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2525436Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2528047Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2530591Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2533149Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2535632Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2538094Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2540610Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2543062Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2545664Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2547289Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.2547848Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2548183Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2548523Z unimplemented [] 2025-12-04T09:14:57.2548900Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2549407Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2550916Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2552317Z graph_break [] 2025-12-04T09:14:57.2552674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2553063Z Autotune Choices Stats: 2025-12-04T09:14:57.2554668Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2556508Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2557100Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2557816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2559476Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2562206Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2564768Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2567385Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2569799Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2572250Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2573915Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.2574394Z Autotune Choices Stats: 2025-12-04T09:14:57.2576124Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2578147Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2579072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2580246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2582221Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2584763Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2587303Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2589772Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2592302Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2594763Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2597194Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2599698Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2602305Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2604867Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2606451Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.2606983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2607424Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2607740Z unimplemented [] 2025-12-04T09:14:57.2608057Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2608510Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2610079Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2611473Z graph_break [] 2025-12-04T09:14:57.2611866Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2612361Z Autotune Choices Stats: 2025-12-04T09:14:57.2613930Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.2615789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2616422Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2617110Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2618837Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2621300Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2623816Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2626210Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2628616Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2631140Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2632684Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.2633152Z Autotune Choices Stats: 2025-12-04T09:14:57.2634829Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2636857Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2637771Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2638887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2640897Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2643477Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2645998Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2648502Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2651017Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2653537Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2655972Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2658595Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2661178Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2663954Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2665543Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.2666105Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.2666575Z Traceback (most recent call last): 2025-12-04T09:14:57.2667155Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.2667641Z self.assertTrue( 2025-12-04T09:14:57.2668091Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.2668542Z raise self.failureException(msg) 2025-12-04T09:14:57.2669006Z AssertionError: False is not true : Log file /tmp/tmpmmrfilt6/flex_attention_configs.json was not created 2025-12-04T09:14:57.2669402Z 2025-12-04T09:14:57.2669579Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.2670139Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.2670525Z 2025-12-04T09:14:57.2670704Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.2671235Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2671571Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2671843Z unimplemented [] 2025-12-04T09:14:57.2672213Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2673710Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.2675196Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2675720Z graph_break [] 2025-12-04T09:14:57.2676029Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2677344Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.2678605Z current_size = base.storage().size() 2025-12-04T09:14:57.2679002Z Autotune Choices Stats: 2025-12-04T09:14:57.2680716Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.2682495Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2683118Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2683936Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2685551Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2687979Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2690366Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2692847Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2695266Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2697721Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2699293Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.2699790Z Autotune Choices Stats: 2025-12-04T09:14:57.2701425Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.2703426Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2704458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2705503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2707413Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2709905Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2712340Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2714854Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2717362Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2719867Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2722379Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2724863Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2727392Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2729856Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2731426Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.2732008Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2732444Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2732712Z unimplemented [] 2025-12-04T09:14:57.2733083Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2733531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2735226Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2736802Z graph_break [] 2025-12-04T09:14:57.2737082Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2737428Z Autotune Choices Stats: 2025-12-04T09:14:57.2739095Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2740868Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2741530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2742219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2743820Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2746232Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2748638Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2751015Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2753572Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2756014Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2757531Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.2758056Z Autotune Choices Stats: 2025-12-04T09:14:57.2759693Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2761732Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2763248Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2764332Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2766297Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2768780Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2771219Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2773994Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2776564Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2779091Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2781563Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2783999Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2786547Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2789013Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2790550Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.2791168Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2791572Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2791806Z unimplemented [] 2025-12-04T09:14:57.2792166Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2792625Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2794264Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2795718Z graph_break [] 2025-12-04T09:14:57.2796046Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2796435Z Autotune Choices Stats: 2025-12-04T09:14:57.2798036Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2799839Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2800458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2801200Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2802818Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2805250Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2807637Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2810003Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2812585Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2815028Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2816578Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.2817090Z Autotune Choices Stats: 2025-12-04T09:14:57.2818713Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2820839Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2821760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2822763Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2824708Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2827183Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2829780Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2832309Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2834764Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2837261Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2839713Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2842190Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2844683Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2847136Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2848705Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.2849425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2849763Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2850074Z unimplemented [] 2025-12-04T09:14:57.2850402Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2850817Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2852464Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2853928Z graph_break [] 2025-12-04T09:14:57.2854245Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2854681Z Autotune Choices Stats: 2025-12-04T09:14:57.2856255Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2858039Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2858731Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2859427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2861041Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2863782Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2866186Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2868747Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2871328Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2873707Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2875268Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.2875757Z Autotune Choices Stats: 2025-12-04T09:14:57.2877363Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2879623Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2880509Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2881544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2883544Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2886014Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2888727Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2891215Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2893761Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2896292Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2898742Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2901273Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2903733Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2906272Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.2907946Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.2908520Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2908884Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2909205Z unimplemented [] 2025-12-04T09:14:57.2909481Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2909930Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2911515Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2912912Z graph_break [] 2025-12-04T09:14:57.2913206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2913605Z Autotune Choices Stats: 2025-12-04T09:14:57.2915205Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2916957Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2917670Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2918404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2920051Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2922470Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2924912Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2927423Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2929790Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2932167Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2933825Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.2934267Z Autotune Choices Stats: 2025-12-04T09:14:57.2935903Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2938047Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2938978Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2940061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2941954Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2943285Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2944521Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2945776Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2946957Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2948174Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2949375Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2950544Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2951771Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2953116Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2953426Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.2953630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2953758Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2953850Z unimplemented [] 2025-12-04T09:14:57.2954032Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2954307Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2955511Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2955629Z graph_break [] 2025-12-04T09:14:57.2955801Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2955884Z Autotune Choices Stats: 2025-12-04T09:14:57.2957384Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.2957661Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2957937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2958280Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2959474Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2960652Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2962345Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2963598Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2964756Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2965912Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2966263Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.2966450Z Autotune Choices Stats: 2025-12-04T09:14:57.2967925Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.2968420Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2968783Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2969439Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2971382Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2972795Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2973982Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2975203Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2976488Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2977683Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.2978892Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2980073Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.2981374Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.2982679Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2982976Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.2983140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.2983275Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.2983432Z unimplemented [] 2025-12-04T09:14:57.2983568Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.2983867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.2985086Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.2985214Z graph_break [] 2025-12-04T09:14:57.2985392Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.2985524Z Autotune Choices Stats: 2025-12-04T09:14:57.2986935Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.2987302Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2987556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2987899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2989103Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2990327Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2991578Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.2992782Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.2993919Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2995095Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.2995370Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.2995545Z Autotune Choices Stats: 2025-12-04T09:14:57.2997028Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.2997561Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.2997918Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.2998531Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.2999853Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3001159Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3002344Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3003555Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3004780Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3005939Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3007206Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3008442Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3009728Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3010936Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3011202Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.3011492Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.3011600Z Traceback (most recent call last): 2025-12-04T09:14:57.3011941Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.3012066Z self.assertTrue( 2025-12-04T09:14:57.3012404Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.3012505Z raise self.failureException(msg) 2025-12-04T09:14:57.3012872Z AssertionError: False is not true : Log file /tmp/tmpo4q1ub_j/flex_attention_configs.json was not created 2025-12-04T09:14:57.3012877Z 2025-12-04T09:14:57.3013057Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.3013368Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.3013372Z 2025-12-04T09:14:57.3013562Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.3013722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3013899Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3014003Z unimplemented [] 2025-12-04T09:14:57.3014182Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3015397Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.3015646Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3015719Z graph_break [] 2025-12-04T09:14:57.3015940Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3017019Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.3017269Z current_size = base.storage().size() 2025-12-04T09:14:57.3017397Z Autotune Choices Stats: 2025-12-04T09:14:57.3018830Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.3019186Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3019457Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3019811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3021031Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3022215Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3023338Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3024580Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3025718Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3026949Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3027300Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.3027414Z Autotune Choices Stats: 2025-12-04T09:14:57.3028912Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.3029421Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3029790Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3030405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3031590Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3032882Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3034073Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3035288Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3036625Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3037796Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3039088Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3040292Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3041493Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3042695Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3042970Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.3043204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3043315Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3043405Z unimplemented [] 2025-12-04T09:14:57.3043568Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3043779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3045090Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3045286Z graph_break [] 2025-12-04T09:14:57.3045473Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3045600Z Autotune Choices Stats: 2025-12-04T09:14:57.3047023Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3047347Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3047582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3048027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3049194Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3050420Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3051565Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3052804Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3054079Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3055348Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3055654Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.3055749Z Autotune Choices Stats: 2025-12-04T09:14:57.3057211Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3057737Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3058120Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3058751Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3059942Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3061159Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3062669Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3064025Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3065324Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3066502Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3067724Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3069013Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3070196Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3071404Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3071697Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.3071916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3072001Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3072153Z unimplemented [] 2025-12-04T09:14:57.3072331Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3072696Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3073942Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3074033Z graph_break [] 2025-12-04T09:14:57.3074275Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3074386Z Autotune Choices Stats: 2025-12-04T09:14:57.3075810Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3076134Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3076380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3076745Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3077948Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3079161Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3080303Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3081515Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3082794Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3083988Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3084313Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.3084408Z Autotune Choices Stats: 2025-12-04T09:14:57.3085888Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3086351Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3086785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3087383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3088598Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3089778Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3091090Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3092378Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3093695Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3094909Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3096103Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3097311Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3098474Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3099779Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3100169Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.3100396Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3100492Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3100602Z unimplemented [] 2025-12-04T09:14:57.3100794Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3101045Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3102288Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3102381Z graph_break [] 2025-12-04T09:14:57.3102572Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3102653Z Autotune Choices Stats: 2025-12-04T09:14:57.3104181Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3104456Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3104741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3105080Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3106241Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3107454Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3108671Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3109871Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3111175Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3112331Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3112687Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.3112794Z Autotune Choices Stats: 2025-12-04T09:14:57.3114277Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3114742Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3115151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3115728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3117007Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3118197Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3119585Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3120805Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3121962Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3123233Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3124435Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3125636Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3126837Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3128058Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3128549Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.3128713Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3128811Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3128934Z unimplemented [] 2025-12-04T09:14:57.3129069Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3129359Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3130575Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3130701Z graph_break [] 2025-12-04T09:14:57.3130857Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3130952Z Autotune Choices Stats: 2025-12-04T09:14:57.3132462Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3132787Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3133081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3133424Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3134614Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3135767Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3137043Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3138291Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3139476Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3140707Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3140982Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.3141062Z Autotune Choices Stats: 2025-12-04T09:14:57.3142606Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3143070Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3143473Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3144062Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3145257Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3146747Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3148005Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3149224Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3150392Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3151583Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3152863Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3154038Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3155299Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3156534Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3156843Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.3157065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3157246Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3157339Z unimplemented [] 2025-12-04T09:14:57.3157476Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3157752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3158936Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3159132Z graph_break [] 2025-12-04T09:14:57.3159289Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3159383Z Autotune Choices Stats: 2025-12-04T09:14:57.3160851Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.3161124Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3161441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3162027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3163268Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3164571Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3165848Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3167073Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3168231Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3169428Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3169703Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.3169871Z Autotune Choices Stats: 2025-12-04T09:14:57.3171321Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3171894Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3172247Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3172951Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3174216Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3175493Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3176731Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3177979Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3179190Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3180369Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3181576Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3182791Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3184143Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3185365Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3185644Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.3185817Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3185985Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3186090Z unimplemented [] 2025-12-04T09:14:57.3186312Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3186531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3187730Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3187842Z graph_break [] 2025-12-04T09:14:57.3188060Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3188200Z Autotune Choices Stats: 2025-12-04T09:14:57.3189630Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3189935Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3190187Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3190604Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3191773Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3193030Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3194252Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3195437Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3196556Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3197816Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3198085Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.3198212Z Autotune Choices Stats: 2025-12-04T09:14:57.3199681Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.3200166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3200618Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3201324Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3202584Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3203814Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3205012Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3206231Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3207461Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3208638Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3209864Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3211129Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3212514Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3213789Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3214071Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.3214264Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3214365Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3214443Z unimplemented [] 2025-12-04T09:14:57.3214693Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3214911Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3216198Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3216288Z graph_break [] 2025-12-04T09:14:57.3216443Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3216598Z Autotune Choices Stats: 2025-12-04T09:14:57.3218054Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.3218368Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3218616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3219002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3220224Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3221545Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3222707Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3223886Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3225067Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3226192Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3226550Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.3226648Z Autotune Choices Stats: 2025-12-04T09:14:57.3228143Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3228631Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3229068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3229807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3231047Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3232245Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3233449Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3234622Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3235866Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3237099Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3238370Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3239649Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3240827Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3242086Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3242371Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.3242609Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.3242728Z Traceback (most recent call last): 2025-12-04T09:14:57.3243096Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.3243177Z self.assertTrue( 2025-12-04T09:14:57.3243466Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.3243626Z raise self.failureException(msg) 2025-12-04T09:14:57.3243892Z AssertionError: False is not true : Log file /tmp/tmp14xpub2o/flex_attention_configs.json was not created 2025-12-04T09:14:57.3243897Z 2025-12-04T09:14:57.3244130Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.3244412Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.3244416Z 2025-12-04T09:14:57.3244608Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.3244864Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3244964Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3245053Z unimplemented [] 2025-12-04T09:14:57.3245216Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3246446Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.3246720Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3246821Z graph_break [] 2025-12-04T09:14:57.3247144Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3248177Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.3248293Z current_size = base.storage().size() 2025-12-04T09:14:57.3248416Z Autotune Choices Stats: 2025-12-04T09:14:57.3249885Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.3250209Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3250470Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3250845Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3252001Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3253269Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3254461Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3255644Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3256888Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3258090Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3258431Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.3258557Z Autotune Choices Stats: 2025-12-04T09:14:57.3260035Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.3260500Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3260896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3261473Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3263027Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3264227Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3265440Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3266724Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3268036Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3269327Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3270498Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3271706Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3272934Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3274125Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3274457Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.3274667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3274764Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3274977Z unimplemented [] 2025-12-04T09:14:57.3275170Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3275388Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3276683Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3276788Z graph_break [] 2025-12-04T09:14:57.3276976Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3277073Z Autotune Choices Stats: 2025-12-04T09:14:57.3278500Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3278847Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3279127Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3279504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3280666Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3281827Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3283025Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3284295Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3285522Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3286658Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3286978Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.3287059Z Autotune Choices Stats: 2025-12-04T09:14:57.3288605Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3289070Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3289451Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3290037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3291265Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3292603Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3293895Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3295168Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3296374Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3297566Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3298784Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3300011Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3301206Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3302406Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3302867Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.3303070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3303199Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3303321Z unimplemented [] 2025-12-04T09:14:57.3303455Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3303665Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3304895Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3305055Z graph_break [] 2025-12-04T09:14:57.3305258Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3305354Z Autotune Choices Stats: 2025-12-04T09:14:57.3306795Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3307070Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3307389Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3307742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3308929Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3310065Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3311239Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3312451Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3313777Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3314917Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3315235Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.3315329Z Autotune Choices Stats: 2025-12-04T09:14:57.3316797Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3317300Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3317696Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3318283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3319544Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3320793Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3322095Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3323336Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3324507Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3325707Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3326877Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3328120Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3329368Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3330602Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3331011Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.3331182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3331350Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3331455Z unimplemented [] 2025-12-04T09:14:57.3331587Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3331829Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3333088Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3333215Z graph_break [] 2025-12-04T09:14:57.3333417Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3333525Z Autotune Choices Stats: 2025-12-04T09:14:57.3334968Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3335240Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3335532Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3335860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3337104Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3338250Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3339519Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3340718Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3341873Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3343107Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3343433Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.3343563Z Autotune Choices Stats: 2025-12-04T09:14:57.3345012Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3345489Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3345892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3346535Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3347725Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3349065Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3350242Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3351520Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3352720Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3353932Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3355129Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3356300Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3357611Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3358865Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3359168Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.3359352Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3359529Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3359609Z unimplemented [] 2025-12-04T09:14:57.3359785Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3360046Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3361244Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3361381Z graph_break [] 2025-12-04T09:14:57.3361533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3361691Z Autotune Choices Stats: 2025-12-04T09:14:57.3363467Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3363806Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3364056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3364408Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3365588Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3366901Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3368224Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3369403Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3370539Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3371697Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3372024Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.3372183Z Autotune Choices Stats: 2025-12-04T09:14:57.3458584Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3459115Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3459464Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3460035Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3461349Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3463302Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3464484Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3465633Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3466770Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3467924Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3469066Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3470206Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3471575Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3472732Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3473003Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.3473145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3473220Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3473292Z unimplemented [] 2025-12-04T09:14:57.3473401Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3473595Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3474780Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3474854Z graph_break [] 2025-12-04T09:14:57.3474988Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3475057Z Autotune Choices Stats: 2025-12-04T09:14:57.3476460Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.3476718Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3476943Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3477259Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3478384Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3479651Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3480755Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3481866Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3482971Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3484088Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3484339Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.3484406Z Autotune Choices Stats: 2025-12-04T09:14:57.3485841Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3486275Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3486603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3487226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3488453Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3489599Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3490743Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3491893Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3493110Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3494269Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3495408Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3496620Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3497828Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3498973Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3499227Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.3499358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3499429Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3499499Z unimplemented [] 2025-12-04T09:14:57.3499603Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3499800Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3500990Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3501051Z graph_break [] 2025-12-04T09:14:57.3501183Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3501249Z Autotune Choices Stats: 2025-12-04T09:14:57.3502639Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3502885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3503108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3503421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3504611Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3505777Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3506886Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3507988Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3509104Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3510211Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3510463Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.3510532Z Autotune Choices Stats: 2025-12-04T09:14:57.3511954Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.3512386Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3512867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3513425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3514595Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3515737Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3516883Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3518028Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3519170Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3520313Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3521523Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3522739Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3523889Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3525024Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3525274Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.3525408Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3525477Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3525547Z unimplemented [] 2025-12-04T09:14:57.3525650Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3525842Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3527028Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3527087Z graph_break [] 2025-12-04T09:14:57.3527226Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3527293Z Autotune Choices Stats: 2025-12-04T09:14:57.3528695Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.3528938Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3529321Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3529702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3530832Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3531939Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3533129Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3534230Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3535346Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3536484Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3536738Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.3536805Z Autotune Choices Stats: 2025-12-04T09:14:57.3538323Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3538830Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3539164Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3539728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3540898Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3542049Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3543206Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3544348Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3545493Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3546711Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3547910Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3549060Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3550210Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3551359Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3551614Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.3551746Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3551816Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3551884Z unimplemented [] 2025-12-04T09:14:57.3551988Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3552173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3553361Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3553422Z graph_break [] 2025-12-04T09:14:57.3553554Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3553620Z Autotune Choices Stats: 2025-12-04T09:14:57.3555095Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3555449Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3555675Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3555990Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3557119Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3558229Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3559338Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3560452Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3561564Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3562987Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3563246Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.3563314Z Autotune Choices Stats: 2025-12-04T09:14:57.3564868Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3565397Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3565735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3566300Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3567478Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3568623Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3569774Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3570924Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3572063Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3573434Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3574579Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3575733Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3576878Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3578020Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3578273Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.3578448Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.3578527Z Traceback (most recent call last): 2025-12-04T09:14:57.3578836Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.3578906Z self.assertTrue( 2025-12-04T09:14:57.3579101Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.3579189Z raise self.failureException(msg) 2025-12-04T09:14:57.3579429Z AssertionError: False is not true : Log file /tmp/tmpy18z1z5_/flex_attention_configs.json was not created 2025-12-04T09:14:57.3579434Z 2025-12-04T09:14:57.3579574Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.3579831Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.3579836Z 2025-12-04T09:14:57.3579999Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.3580136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3580205Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3580367Z unimplemented [] 2025-12-04T09:14:57.3580534Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3581719Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.3583177Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3583510Z graph_break [] 2025-12-04T09:14:57.3583745Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3584949Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.3586089Z current_size = base.storage().size() 2025-12-04T09:14:57.3586318Z Autotune Choices Stats: 2025-12-04T09:14:57.3587834Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.3589546Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3590092Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3590705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3592411Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3594707Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3596978Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3599401Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3601860Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3604168Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3605588Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.3605984Z Autotune Choices Stats: 2025-12-04T09:14:57.3607543Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.3609466Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3610299Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3611265Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3613131Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3615592Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3618011Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3620362Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3622726Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3625076Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3627422Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3629770Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3632125Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3634627Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3636086Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.3636557Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3636840Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3637037Z unimplemented [] 2025-12-04T09:14:57.3637249Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3637613Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3639057Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3640358Z graph_break [] 2025-12-04T09:14:57.3640595Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3640871Z Autotune Choices Stats: 2025-12-04T09:14:57.3642371Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3644054Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3644593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3645206Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3646733Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3649019Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3651365Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3653807Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3656102Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3658386Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3659821Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.3660214Z Autotune Choices Stats: 2025-12-04T09:14:57.3661756Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3663858Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3664705Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3665668Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3667570Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3670056Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3672415Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3674759Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3677107Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3679452Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3681807Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3684173Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3686593Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3689006Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3690465Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.3690933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3691218Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3691416Z unimplemented [] 2025-12-04T09:14:57.3691624Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3691985Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3693480Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3694779Z graph_break [] 2025-12-04T09:14:57.3695007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3695279Z Autotune Choices Stats: 2025-12-04T09:14:57.3696776Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3698468Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3699009Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3699624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3701141Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3703498Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3705865Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3708146Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3710438Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3712720Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3714147Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.3714528Z Autotune Choices Stats: 2025-12-04T09:14:57.3716061Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3717975Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3718807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3719767Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3721639Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3724084Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3726446Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3728798Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3731145Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3733558Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3735908Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3738326Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3740761Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3743111Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3744567Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.3745018Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3745298Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3745486Z unimplemented [] 2025-12-04T09:14:57.3745688Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3746049Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3747483Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3748786Z graph_break [] 2025-12-04T09:14:57.3749021Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3749302Z Autotune Choices Stats: 2025-12-04T09:14:57.3750799Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3752484Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3753014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3753622Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3755206Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3757557Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3759836Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3762309Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3764608Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3766885Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3768314Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.3768714Z Autotune Choices Stats: 2025-12-04T09:14:57.3770252Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3772181Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3773056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3774263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3776250Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3778640Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3781012Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3783376Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3785744Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3788106Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3790463Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3792965Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3795317Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3797664Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3799126Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.3799602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3799886Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3800080Z unimplemented [] 2025-12-04T09:14:57.3800283Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3800654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3802101Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3803397Z graph_break [] 2025-12-04T09:14:57.3803616Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3803904Z Autotune Choices Stats: 2025-12-04T09:14:57.3805406Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3807088Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3807621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3808225Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3809923Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3812211Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3814566Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3816847Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3819130Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3821403Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3822825Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.3823217Z Autotune Choices Stats: 2025-12-04T09:14:57.3824750Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3826737Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3827635Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3828600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3830394Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3832765Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3835122Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3837475Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3839841Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3842190Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3844627Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3847043Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3849392Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3851733Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3853279Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.3853732Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3854019Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3854216Z unimplemented [] 2025-12-04T09:14:57.3854420Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3854775Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3856228Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3857521Z graph_break [] 2025-12-04T09:14:57.3857745Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3858015Z Autotune Choices Stats: 2025-12-04T09:14:57.3859509Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.3861204Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3862619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3863342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3864894Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3867220Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3869518Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3871816Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3874105Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3876393Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3877824Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.3878210Z Autotune Choices Stats: 2025-12-04T09:14:57.3879836Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3881874Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3882707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3883669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3885481Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3887867Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3890260Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3892722Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3895109Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3897548Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3899994Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3902362Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3904727Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3907100Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3908577Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.3909035Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3909322Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3909515Z unimplemented [] 2025-12-04T09:14:57.3909723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3910089Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3911525Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3912827Z graph_break [] 2025-12-04T09:14:57.3913052Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3913332Z Autotune Choices Stats: 2025-12-04T09:14:57.3914912Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.3916674Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3917219Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3917833Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3919368Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3921678Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3923979Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3926272Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3928563Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3930858Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3932339Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.3932759Z Autotune Choices Stats: 2025-12-04T09:14:57.3934457Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.3936381Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3937207Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3938176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3939995Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3942385Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3944778Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3947150Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3949528Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3952069Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.3954432Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3956789Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.3959164Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3961539Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.3963247Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.3963709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.3963996Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.3964190Z unimplemented [] 2025-12-04T09:14:57.3964394Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.3964759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.3966196Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.3967494Z graph_break [] 2025-12-04T09:14:57.3967715Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.3967996Z Autotune Choices Stats: 2025-12-04T09:14:57.3969710Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.3971406Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3971947Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3972641Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3974170Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3976474Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3978918Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.3981205Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.3983499Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3985851Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3987389Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.3987780Z Autotune Choices Stats: 2025-12-04T09:14:57.3989330Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.3991252Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.3992084Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.3993051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.3994852Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.3997235Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.3999620Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4001986Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4004412Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4006842Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4009214Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4011571Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4014021Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4016383Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4017849Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.4018303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4018580Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4018779Z unimplemented [] 2025-12-04T09:14:57.4018986Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4019345Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4020861Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4022236Z graph_break [] 2025-12-04T09:14:57.4022463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4022734Z Autotune Choices Stats: 2025-12-04T09:14:57.4024235Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4025932Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4026472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4027093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4028614Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4030923Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4033208Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4035502Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4037790Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4040210Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4041640Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.4042019Z Autotune Choices Stats: 2025-12-04T09:14:57.4043559Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4045473Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4046303Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4047275Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4049076Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4051452Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4053891Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4056333Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4058802Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4061168Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4063785Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4066174Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4068550Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4070927Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4072399Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.4072862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4073154Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4073348Z unimplemented [] 2025-12-04T09:14:57.4073558Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4073923Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4075565Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4076860Z graph_break [] 2025-12-04T09:14:57.4077085Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4077373Z Autotune Choices Stats: 2025-12-04T09:14:57.4078883Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4080573Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4081105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4081716Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4083241Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4085538Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4087823Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4090119Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4092535Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4094907Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4096342Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.4096733Z Autotune Choices Stats: 2025-12-04T09:14:57.4098268Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4100191Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4101026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4101991Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4103797Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4106178Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4108620Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4111055Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4113427Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4115792Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4118162Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4120535Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4122896Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4125256Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4126722Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.4127433Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.4127772Z Traceback (most recent call last): 2025-12-04T09:14:57.4128215Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.4128651Z self.assertTrue( 2025-12-04T09:14:57.4128976Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.4129329Z raise self.failureException(msg) 2025-12-04T09:14:57.4129718Z AssertionError: False is not true : Log file /tmp/tmpc046nj78/flex_attention_configs.json was not created 2025-12-04T09:14:57.4130033Z 2025-12-04T09:14:57.4130167Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.4130627Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.4130954Z 2025-12-04T09:14:57.4131125Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.4131500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4131776Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4131967Z unimplemented [] 2025-12-04T09:14:57.4132173Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4133583Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.4135017Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4135350Z graph_break [] 2025-12-04T09:14:57.4135579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4136775Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.4137896Z current_size = base.storage().size() 2025-12-04T09:14:57.4138116Z Autotune Choices Stats: 2025-12-04T09:14:57.4139624Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.4141315Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4141852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4142460Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4144063Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4146423Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4148711Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4151004Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4153290Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4155581Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4157005Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.4157393Z Autotune Choices Stats: 2025-12-04T09:14:57.4158937Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.4160855Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4161757Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4163432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4165226Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4167570Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4169936Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4172261Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4174644Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4177019Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4179495Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4181942Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4184301Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4186659Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4188115Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.4188580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4188871Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4189074Z unimplemented [] 2025-12-04T09:14:57.4189298Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4189658Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4191100Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4192398Z graph_break [] 2025-12-04T09:14:57.4192621Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4192896Z Autotune Choices Stats: 2025-12-04T09:14:57.4194406Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4196074Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4196622Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4197341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4198919Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4201198Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4203464Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4205737Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4208002Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4210282Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4211720Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.4212104Z Autotune Choices Stats: 2025-12-04T09:14:57.4213720Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4215784Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4216619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4217588Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4219402Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4221774Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4224138Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4226488Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4228840Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4231192Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4233671Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4236030Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4238389Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4240763Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4242239Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.4242700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4242983Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4243173Z unimplemented [] 2025-12-04T09:14:57.4243382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4243746Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4245185Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4246488Z graph_break [] 2025-12-04T09:14:57.4246718Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4246997Z Autotune Choices Stats: 2025-12-04T09:14:57.4248502Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4250344Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4250880Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4251524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4253097Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4255389Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4257669Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4259953Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4262395Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4264682Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4266107Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.4266492Z Autotune Choices Stats: 2025-12-04T09:14:57.4268130Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4270153Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4270978Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4271936Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4273739Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4276106Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4278471Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4280829Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4283184Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4285607Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4288023Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4290377Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4292768Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4295131Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4296585Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.4297044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4297328Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4297531Z unimplemented [] 2025-12-04T09:14:57.4297741Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4298108Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4299555Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4300850Z graph_break [] 2025-12-04T09:14:57.4301071Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4301353Z Autotune Choices Stats: 2025-12-04T09:14:57.4302930Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4304690Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4305233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4305846Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4307369Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4309671Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4311974Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4314268Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4316559Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4318848Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4320269Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.4320804Z Autotune Choices Stats: 2025-12-04T09:14:57.4322344Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4324265Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4325098Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4325661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4326830Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4327982Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4329137Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4330283Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4331486Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4332769Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4333912Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4335076Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4336215Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4337357Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4337608Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.4337737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4337811Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4337879Z unimplemented [] 2025-12-04T09:14:57.4337981Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4338165Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4339348Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4339408Z graph_break [] 2025-12-04T09:14:57.4339540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4339738Z Autotune Choices Stats: 2025-12-04T09:14:57.4341141Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4341385Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4341606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4341924Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4343054Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4344170Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4345281Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4346379Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4347481Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4348656Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4348971Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.4349038Z Autotune Choices Stats: 2025-12-04T09:14:57.4350466Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4350899Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4351229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4351790Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4352943Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4354086Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4355230Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4356366Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4357650Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4358801Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4359944Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4361087Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4362382Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4363545Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4363803Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.4363936Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4364007Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4364077Z unimplemented [] 2025-12-04T09:14:57.4364184Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4364372Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4365704Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4365858Z graph_break [] 2025-12-04T09:14:57.4365997Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4366070Z Autotune Choices Stats: 2025-12-04T09:14:57.4367472Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.4367720Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4367937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4368260Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4369386Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4370516Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4371624Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4372767Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4373947Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4375106Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4375365Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.4375434Z Autotune Choices Stats: 2025-12-04T09:14:57.4376863Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4377300Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4377630Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4378194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4379379Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4380537Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4381694Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4382900Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4384107Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4385259Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4386403Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4387550Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4388694Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4389845Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4390098Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.4390226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4390296Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4390364Z unimplemented [] 2025-12-04T09:14:57.4390465Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4390792Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4391970Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4392029Z graph_break [] 2025-12-04T09:14:57.4392159Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4392225Z Autotune Choices Stats: 2025-12-04T09:14:57.4393612Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4393860Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4394076Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4394395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4395518Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4396625Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4397734Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4398835Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4400064Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4401167Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4401420Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.4401487Z Autotune Choices Stats: 2025-12-04T09:14:57.4402907Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.4403340Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4403674Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4404234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4405397Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4406547Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4407765Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4408967Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4410113Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4411261Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4412479Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4413641Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4414780Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4415924Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4416169Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.4416457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4416529Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4416597Z unimplemented [] 2025-12-04T09:14:57.4416699Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4416883Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4418055Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4418117Z graph_break [] 2025-12-04T09:14:57.4418253Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4418319Z Autotune Choices Stats: 2025-12-04T09:14:57.4419713Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.4419956Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4420180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4420498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4421613Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4422724Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4423832Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4424995Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4426164Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4427264Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4427519Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.4427587Z Autotune Choices Stats: 2025-12-04T09:14:57.4429035Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4429471Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4429798Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4430350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4431516Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4432665Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4440262Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4441505Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4442686Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4443837Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4444974Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4446111Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4447239Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4448440Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4448775Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.4448925Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4449002Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4449069Z unimplemented [] 2025-12-04T09:14:57.4449188Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4449385Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4450581Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4450648Z graph_break [] 2025-12-04T09:14:57.4450790Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4450859Z Autotune Choices Stats: 2025-12-04T09:14:57.4452267Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4452602Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4452825Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4453159Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4454289Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4455400Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4456571Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4457771Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4458890Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4459992Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4460246Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.4460313Z Autotune Choices Stats: 2025-12-04T09:14:57.4461747Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4462347Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4462685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4463247Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4464417Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4465678Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4466920Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4468068Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4469227Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4470374Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4471516Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4472657Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4473865Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4475069Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4475318Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.4475457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4475526Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4475591Z unimplemented [] 2025-12-04T09:14:57.4475704Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4475891Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4477071Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4477143Z graph_break [] 2025-12-04T09:14:57.4477273Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4477350Z Autotune Choices Stats: 2025-12-04T09:14:57.4478739Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4478995Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4479215Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4479535Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4480655Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4481828Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4482989Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4484093Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4485202Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4486301Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4486553Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.4486619Z Autotune Choices Stats: 2025-12-04T09:14:57.4488039Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4488484Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4488807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4489369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4490610Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4491826Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4493025Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4494162Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4495313Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4496450Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4497600Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4498744Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4500021Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4501176Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4501424Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.4501560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4501631Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4501694Z unimplemented [] 2025-12-04T09:14:57.4501807Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4501991Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4503173Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4503236Z graph_break [] 2025-12-04T09:14:57.4503365Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4503441Z Autotune Choices Stats: 2025-12-04T09:14:57.4504818Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.4505072Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4505292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4505615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4506731Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4507941Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4509110Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4510228Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4511327Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4512434Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4512683Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.4512750Z Autotune Choices Stats: 2025-12-04T09:14:57.4514170Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4514612Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4514938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4515571Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4516805Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4517952Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4519102Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4520238Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4521392Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4522526Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4523668Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4524896Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4526100Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4527238Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4527485Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.4527661Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.4527739Z Traceback (most recent call last): 2025-12-04T09:14:57.4528042Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.4528116Z self.assertTrue( 2025-12-04T09:14:57.4528310Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.4528401Z raise self.failureException(msg) 2025-12-04T09:14:57.4528649Z AssertionError: False is not true : Log file /tmp/tmpv3f897i8/flex_attention_configs.json was not created 2025-12-04T09:14:57.4528654Z 2025-12-04T09:14:57.4528791Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.4529049Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.4529053Z 2025-12-04T09:14:57.4529215Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.4529350Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4529419Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4529482Z unimplemented [] 2025-12-04T09:14:57.4529592Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4530784Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.4530977Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4531039Z graph_break [] 2025-12-04T09:14:57.4531167Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4532227Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.4532847Z current_size = base.storage().size() 2025-12-04T09:14:57.4532927Z Autotune Choices Stats: 2025-12-04T09:14:57.4534344Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.4534602Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4534826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4535147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4536281Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4537382Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4538508Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4539614Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4540714Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4541880Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4542193Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.4542270Z Autotune Choices Stats: 2025-12-04T09:14:57.4543693Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.4544134Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4544461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4545027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4546183Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4547328Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4548472Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4549670Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4550894Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4552029Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4553172Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4554319Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4555462Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4556593Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4556845Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.4556977Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4557057Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4557121Z unimplemented [] 2025-12-04T09:14:57.4557226Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4557415Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4558653Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4558788Z graph_break [] 2025-12-04T09:14:57.4558918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4558992Z Autotune Choices Stats: 2025-12-04T09:14:57.4560383Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4560638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4560856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4561168Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4562480Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4563618Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4564736Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4565850Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4567058Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4568257Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4568508Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.4568582Z Autotune Choices Stats: 2025-12-04T09:14:57.4570015Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4570461Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4570804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4571369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4572574Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4573732Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4574872Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4576083Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4577279Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4578422Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4579562Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4580699Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4581842Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4582985Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4583236Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.4583368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4583449Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4583587Z unimplemented [] 2025-12-04T09:14:57.4583781Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4583982Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4585157Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4585227Z graph_break [] 2025-12-04T09:14:57.4585358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4585425Z Autotune Choices Stats: 2025-12-04T09:14:57.4586827Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4587073Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4587296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4587609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4588741Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4589847Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4590956Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4592264Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4593437Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4594550Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4594798Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.4594871Z Autotune Choices Stats: 2025-12-04T09:14:57.4596297Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4596738Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4597064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4597629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4598788Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4599941Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4601145Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4602349Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4603489Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4604631Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4605775Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4606915Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4608067Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4609205Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4609581Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.4609713Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4609789Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4609853Z unimplemented [] 2025-12-04T09:14:57.4609955Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4610147Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4611324Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4611394Z graph_break [] 2025-12-04T09:14:57.4611522Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4611589Z Autotune Choices Stats: 2025-12-04T09:14:57.4613041Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4613286Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4613512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4613826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4614948Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4616054Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4617169Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4618363Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4619523Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4620637Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4620881Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.4620954Z Autotune Choices Stats: 2025-12-04T09:14:57.4622372Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4622811Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4623137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4623702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4624861Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4626079Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4627283Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4628447Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4629593Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4630742Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4631886Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4633025Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4634173Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4635370Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4635749Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.4635877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4635952Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4636015Z unimplemented [] 2025-12-04T09:14:57.4636116Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4636306Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4637473Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4637543Z graph_break [] 2025-12-04T09:14:57.4637668Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4637743Z Autotune Choices Stats: 2025-12-04T09:14:57.4639140Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4639390Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4639617Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4639937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4641065Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4642171Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4643342Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4644507Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4645611Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4646718Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4646963Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.4647039Z Autotune Choices Stats: 2025-12-04T09:14:57.4648484Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4648922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4649253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4649817Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4650975Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4652283Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4653507Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4654652Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4655790Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4656942Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4658090Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4659235Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4660443Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4661642Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4662054Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.4662200Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4662284Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4662350Z unimplemented [] 2025-12-04T09:14:57.4662460Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4662660Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4663834Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4663901Z graph_break [] 2025-12-04T09:14:57.4664032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4664100Z Autotune Choices Stats: 2025-12-04T09:14:57.4665500Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.4665742Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4665967Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4666290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4667416Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4668635Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4669838Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4670952Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4672065Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4673176Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4673424Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.4673490Z Autotune Choices Stats: 2025-12-04T09:14:57.4674916Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4675361Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4675685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4676256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4677487Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4678721Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4679869Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4681018Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4682165Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4683316Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4684468Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4685677Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4686888Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4688037Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4688289Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.4688419Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4688496Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4688560Z unimplemented [] 2025-12-04T09:14:57.4688665Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4688858Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4690039Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4690108Z graph_break [] 2025-12-04T09:14:57.4690234Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4690304Z Autotune Choices Stats: 2025-12-04T09:14:57.4691687Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4691938Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4692170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4692524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4693666Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4694953Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4696067Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4697187Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4698288Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4699406Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4699648Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.4699716Z Autotune Choices Stats: 2025-12-04T09:14:57.4701140Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.4701578Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4701900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4702527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4703750Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4704906Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4706052Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4707201Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4708345Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4709492Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4710635Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4711838Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4713067Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4714222Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4719334Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.4719486Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4719560Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4719627Z unimplemented [] 2025-12-04T09:14:57.4719748Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4719956Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4721138Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4721230Z graph_break [] 2025-12-04T09:14:57.4721368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4721439Z Autotune Choices Stats: 2025-12-04T09:14:57.4722838Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.4723100Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4723319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4723633Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4724848Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4725978Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4727074Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4728170Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4729370Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4730487Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4730747Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.4730816Z Autotune Choices Stats: 2025-12-04T09:14:57.4732246Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4732788Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4733226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4733777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4734958Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4736103Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4737294Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4738436Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4739591Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4740737Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4741932Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4743102Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4744237Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4745379Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4745666Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.4745811Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4745884Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4745946Z unimplemented [] 2025-12-04T09:14:57.4746056Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4746244Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4747427Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4747489Z graph_break [] 2025-12-04T09:14:57.4747625Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4747691Z Autotune Choices Stats: 2025-12-04T09:14:57.4749066Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4749319Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4749621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4749969Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4751077Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4752178Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4753276Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4754414Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4755512Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4756609Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4756858Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.4756922Z Autotune Choices Stats: 2025-12-04T09:14:57.4758403Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4758887Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4759215Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4759763Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4760922Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4762271Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4763522Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4764660Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4765791Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4767005Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4768195Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4769337Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4770466Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4771635Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4771883Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.4772019Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4772087Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4772150Z unimplemented [] 2025-12-04T09:14:57.4772260Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4772492Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4773684Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4773746Z graph_break [] 2025-12-04T09:14:57.4773882Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4773950Z Autotune Choices Stats: 2025-12-04T09:14:57.4775459Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4775744Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4775958Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4776270Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4777384Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4778487Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4779638Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4780743Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4781842Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4782937Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4783185Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.4783250Z Autotune Choices Stats: 2025-12-04T09:14:57.4784720Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4785193Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4785518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4786065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4787205Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4788381Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4789517Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4790644Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4791773Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4792993Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4794124Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4795263Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4796441Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4797578Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4797819Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.4797955Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4798025Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4798088Z unimplemented [] 2025-12-04T09:14:57.4798198Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4798379Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4799563Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4799627Z graph_break [] 2025-12-04T09:14:57.4799754Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4799827Z Autotune Choices Stats: 2025-12-04T09:14:57.4801271Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.4801556Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4801770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4802075Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4803190Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4804326Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4805421Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4806520Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4807612Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4808763Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4809065Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.4809132Z Autotune Choices Stats: 2025-12-04T09:14:57.4810542Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4810983Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4811304Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4811850Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4813087Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4814228Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4815363Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4816488Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4817713Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4818871Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4820007Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4821141Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4822303Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4823428Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4823680Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.4823819Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4823891Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4823953Z unimplemented [] 2025-12-04T09:14:57.4824061Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4824244Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4825483Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4825574Z graph_break [] 2025-12-04T09:14:57.4825703Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4825778Z Autotune Choices Stats: 2025-12-04T09:14:57.4827146Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.4827410Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4827628Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4827937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4829042Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4830174Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4831267Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4832365Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4833451Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4834643Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4834888Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.4834954Z Autotune Choices Stats: 2025-12-04T09:14:57.4836356Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4836796Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4837166Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4837716Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4838858Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4839994Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4841127Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4842337Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4843499Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4844639Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4845771Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4846939Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4848065Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4849199Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4849441Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.4849620Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.4849698Z Traceback (most recent call last): 2025-12-04T09:14:57.4850000Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.4850076Z self.assertTrue( 2025-12-04T09:14:57.4850274Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.4850455Z raise self.failureException(msg) 2025-12-04T09:14:57.4850698Z AssertionError: False is not true : Log file /tmp/tmpm745bxoj/flex_attention_configs.json was not created 2025-12-04T09:14:57.4850704Z 2025-12-04T09:14:57.4850837Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.4851096Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.4851100Z 2025-12-04T09:14:57.4851263Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.4851408Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4851488Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4851552Z unimplemented [] 2025-12-04T09:14:57.4851661Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4852893Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.4853088Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4853190Z graph_break [] 2025-12-04T09:14:57.4853317Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4854305Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.4854386Z current_size = base.storage().size() 2025-12-04T09:14:57.4854459Z Autotune Choices Stats: 2025-12-04T09:14:57.4855831Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.4856080Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4856300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4856608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4857728Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4858879Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4860001Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4861096Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4862357Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4863556Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4863804Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.4863875Z Autotune Choices Stats: 2025-12-04T09:14:57.4865282Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.4865731Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4866056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4866607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4867878Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4869016Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4870152Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4871321Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4872446Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4873571Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4874699Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4875911Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4877073Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4878193Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4878443Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.4878578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4878690Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4878754Z unimplemented [] 2025-12-04T09:14:57.4878856Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4879045Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4880220Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4880290Z graph_break [] 2025-12-04T09:14:57.4880417Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4880483Z Autotune Choices Stats: 2025-12-04T09:14:57.4881858Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4882105Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4882319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4882626Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4883797Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4884916Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4886015Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4887111Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4888231Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4889323Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4889564Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.4889636Z Autotune Choices Stats: 2025-12-04T09:14:57.4891033Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4891472Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4891858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4892490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4893661Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4894821Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4895966Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4897142Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4898273Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4899419Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4900560Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4901787Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4902931Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4904072Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4904380Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.4904511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4904589Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4904650Z unimplemented [] 2025-12-04T09:14:57.4904755Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4904943Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4906464Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4906545Z graph_break [] 2025-12-04T09:14:57.4906682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4906751Z Autotune Choices Stats: 2025-12-04T09:14:57.4908147Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4908396Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4908621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4908937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4910225Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4911326Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4912432Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4913573Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4914664Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4915761Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4916005Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.4916079Z Autotune Choices Stats: 2025-12-04T09:14:57.4917494Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4917994Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4918361Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4918929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4920079Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4921221Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4922399Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4923553Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4924689Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4925825Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4927036Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4928199Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4929336Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4930477Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4930756Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.4930892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4930961Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4931024Z unimplemented [] 2025-12-04T09:14:57.4931132Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4931315Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4932550Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4932624Z graph_break [] 2025-12-04T09:14:57.4932757Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4932837Z Autotune Choices Stats: 2025-12-04T09:14:57.4934222Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4934473Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4934809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4935122Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4936246Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4937347Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4938447Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4939595Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4940690Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4941791Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4942034Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.4942113Z Autotune Choices Stats: 2025-12-04T09:14:57.4943589Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4944071Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4944396Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4944953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4946104Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4947286Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4948425Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4949570Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4950725Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4951921Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4953087Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4954225Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4955358Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4956527Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.4956776Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.4956904Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4956979Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4957040Z unimplemented [] 2025-12-04T09:14:57.4957144Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4957324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4958491Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4958557Z graph_break [] 2025-12-04T09:14:57.4958680Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4958750Z Autotune Choices Stats: 2025-12-04T09:14:57.4960196Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.4960475Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4960689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4961000Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4962301Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4963427Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4964742Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4965862Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4966964Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4968070Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4968317Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.4968389Z Autotune Choices Stats: 2025-12-04T09:14:57.4969982Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4970433Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4970762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4971319Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4972522Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4973723Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4974859Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4976003Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4977439Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4978606Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4979747Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.4980876Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.4982051Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.4983193Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4983436Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.4983568Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.4983646Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.4983711Z unimplemented [] 2025-12-04T09:14:57.4983822Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.4984017Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.4985189Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.4985258Z graph_break [] 2025-12-04T09:14:57.4985386Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.4985458Z Autotune Choices Stats: 2025-12-04T09:14:57.4986936Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.4987187Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4987402Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4987718Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4988838Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4989983Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4991081Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4992187Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.4993287Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.4994469Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4994750Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.4994825Z Autotune Choices Stats: 2025-12-04T09:14:57.4996245Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.4996696Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.4997021Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.4997615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.4998779Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.4999928Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5001070Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5002216Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5003437Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5004603Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5005749Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5006881Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5008058Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5009208Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5009454Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.5009580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5009654Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5009715Z unimplemented [] 2025-12-04T09:14:57.5009813Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5010002Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5011233Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5011331Z graph_break [] 2025-12-04T09:14:57.5011455Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5011526Z Autotune Choices Stats: 2025-12-04T09:14:57.5012963Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5013217Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5013429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5013736Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5014851Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5015992Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5017103Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5018214Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5019318Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5020513Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5020759Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.5020831Z Autotune Choices Stats: 2025-12-04T09:14:57.5022248Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.5022684Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5023040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5023595Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5024750Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5025898Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5027035Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5028240Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5029419Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5030556Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5031698Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5032884Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5034026Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5035170Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5035410Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.5035541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5035617Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5035677Z unimplemented [] 2025-12-04T09:14:57.5035776Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5035973Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5037239Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5037308Z graph_break [] 2025-12-04T09:14:57.5037432Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5037504Z Autotune Choices Stats: 2025-12-04T09:14:57.5038921Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.5039174Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5039471Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5039785Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5040926Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5042037Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5043156Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5044270Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5045437Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5046580Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5046829Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.5046907Z Autotune Choices Stats: 2025-12-04T09:14:57.5048332Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5048802Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5049134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5049701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5050863Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5052038Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5053337Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5054531Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5055687Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5056832Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5058020Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5059168Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5060318Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5061461Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5061713Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.5062158Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5062243Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5062309Z unimplemented [] 2025-12-04T09:14:57.5062417Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5062610Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5063794Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5063877Z graph_break [] 2025-12-04T09:14:57.5064016Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5064084Z Autotune Choices Stats: 2025-12-04T09:14:57.5065487Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5065789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5066010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5066327Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5067466Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5068586Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5069704Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5070885Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5072042Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5073169Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5073417Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.5073490Z Autotune Choices Stats: 2025-12-04T09:14:57.5074922Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5075397Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5075724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5076288Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5077453Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5078603Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5079844Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5080993Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5082139Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5083319Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5084468Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5085620Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5086770Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5087985Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5088267Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.5088397Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5088470Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5088536Z unimplemented [] 2025-12-04T09:14:57.5088638Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5088828Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5090003Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5090072Z graph_break [] 2025-12-04T09:14:57.5090199Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5090264Z Autotune Choices Stats: 2025-12-04T09:14:57.5091664Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5091944Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5092167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5092519Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5093662Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5094771Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5095963Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5097106Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5098217Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5099331Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5099603Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.5099674Z Autotune Choices Stats: 2025-12-04T09:14:57.5101094Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5101529Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5101855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5102421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5103582Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5104802Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5105980Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5107137Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5108280Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5109463Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5110619Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5111760Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5112971Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5114146Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5114398Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.5114529Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5114607Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5114668Z unimplemented [] 2025-12-04T09:14:57.5114771Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5114974Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5116143Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5116247Z graph_break [] 2025-12-04T09:14:57.5116375Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5116441Z Autotune Choices Stats: 2025-12-04T09:14:57.5117833Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.5118079Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5118300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5118616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5119751Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5120861Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5122074Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5123197Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5124303Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5125455Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5125702Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.5125773Z Autotune Choices Stats: 2025-12-04T09:14:57.5127188Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5127628Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5127953Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5128524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5129764Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5130967Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5132119Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5133324Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5134524Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5135681Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5136838Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5137981Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5139206Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5140381Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5140633Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.5140763Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5140836Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5140901Z unimplemented [] 2025-12-04T09:14:57.5141008Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5141205Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5142412Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5142481Z graph_break [] 2025-12-04T09:14:57.5142616Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5142683Z Autotune Choices Stats: 2025-12-04T09:14:57.5144089Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.5144335Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5144564Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5144894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5146025Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5147201Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5148347Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5149465Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5150576Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5151728Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5151975Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.5152049Z Autotune Choices Stats: 2025-12-04T09:14:57.5153468Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5153908Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5154230Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5154795Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5156057Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5157232Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5158385Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5159600Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5160747Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5162048Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5163219Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5164475Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5165684Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5166836Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5167094Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.5167227Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5167350Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5167413Z unimplemented [] 2025-12-04T09:14:57.5167515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5167705Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5168886Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5168954Z graph_break [] 2025-12-04T09:14:57.5169083Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5169165Z Autotune Choices Stats: 2025-12-04T09:14:57.5170557Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5170800Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5171025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5171341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5172581Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5173768Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5174893Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5176007Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5177182Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5178308Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5178553Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.5178623Z Autotune Choices Stats: 2025-12-04T09:14:57.5180057Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.5180500Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5180893Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5181481Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5182649Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5183815Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5184966Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5186161Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5187312Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5188463Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5189676Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5190868Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5192023Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5193171Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5193454Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.5193628Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.5193713Z Traceback (most recent call last): 2025-12-04T09:14:57.5194013Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.5194079Z self.assertTrue( 2025-12-04T09:14:57.5194288Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.5194373Z raise self.failureException(msg) 2025-12-04T09:14:57.5194607Z AssertionError: False is not true : Log file /tmp/tmpz_msg098/flex_attention_configs.json was not created 2025-12-04T09:14:57.5194619Z 2025-12-04T09:14:57.5194755Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.5195009Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.5195012Z 2025-12-04T09:14:57.5195183Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.5195316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5195388Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5195454Z unimplemented [] 2025-12-04T09:14:57.5195557Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5196754Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.5196943Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5197003Z graph_break [] 2025-12-04T09:14:57.5197136Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5198224Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.5198314Z current_size = base.storage().size() 2025-12-04T09:14:57.5198380Z Autotune Choices Stats: 2025-12-04T09:14:57.5199776Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.5200022Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5200244Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5200560Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5201722Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5202843Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5203956Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5205061Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5206234Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5207373Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5207623Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.5207688Z Autotune Choices Stats: 2025-12-04T09:14:57.5209118Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.5209587Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5209918Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5210476Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5211650Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5212856Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5214008Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5215225Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5216404Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5217558Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5218707Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5219887Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5221032Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5222179Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5222430Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.5222561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5222631Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5222700Z unimplemented [] 2025-12-04T09:14:57.5222915Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5223100Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5224289Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5224353Z graph_break [] 2025-12-04T09:14:57.5224487Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5224556Z Autotune Choices Stats: 2025-12-04T09:14:57.5225944Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5226224Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5226440Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5226761Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5227889Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5229015Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5230129Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5231238Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5232459Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5233627Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5233879Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.5233947Z Autotune Choices Stats: 2025-12-04T09:14:57.5235371Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5235855Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5236188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5236747Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5237922Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5239078Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5240295Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5241474Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5242624Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5243776Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5244970Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5246117Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5247266Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5248419Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5248761Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.5248899Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5248968Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5249036Z unimplemented [] 2025-12-04T09:14:57.5249136Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5249319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5250497Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5250559Z graph_break [] 2025-12-04T09:14:57.5250689Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5250754Z Autotune Choices Stats: 2025-12-04T09:14:57.5252139Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5252489Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5252723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5253043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5254166Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5255284Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5256397Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5257592Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5258733Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5259842Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5260093Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.5260193Z Autotune Choices Stats: 2025-12-04T09:14:57.5261614Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5262188Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5262517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5263072Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5264237Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5265494Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5266711Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5267858Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5269013Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5270246Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5271387Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5272532Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5273682Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5274891Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5275166Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.5275303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5275378Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5275446Z unimplemented [] 2025-12-04T09:14:57.5275548Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5275732Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5276911Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5276974Z graph_break [] 2025-12-04T09:14:57.5277106Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5277209Z Autotune Choices Stats: 2025-12-04T09:14:57.5278598Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5278847Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5279064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5279387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5280519Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5281629Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5282806Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5283950Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5285060Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5286168Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5286468Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.5286534Z Autotune Choices Stats: 2025-12-04T09:14:57.5287962Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5288392Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5288730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5289289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5290453Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5291668Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5292911Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5294070Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5295256Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5296404Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5297546Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5298697Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5299907Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5301082Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5301324Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.5301460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5301532Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5301605Z unimplemented [] 2025-12-04T09:14:57.5301708Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5301891Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5303069Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5303162Z graph_break [] 2025-12-04T09:14:57.5303299Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5303366Z Autotune Choices Stats: 2025-12-04T09:14:57.5304746Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5304995Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5305211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5305530Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5306656Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5307878Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5309017Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5310131Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5311244Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5312380Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5312631Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.5312705Z Autotune Choices Stats: 2025-12-04T09:14:57.5314125Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5314559Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5314887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5315442Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5316671Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5317870Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5319034Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5320185Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5321374Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5322522Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5323667Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5324816Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5326050Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5327200Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5327445Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.5327578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5327645Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5327707Z unimplemented [] 2025-12-04T09:14:57.5327813Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5327993Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5329208Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5329270Z graph_break [] 2025-12-04T09:14:57.5329401Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5329479Z Autotune Choices Stats: 2025-12-04T09:14:57.5330853Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.5331104Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5331322Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5331639Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5332821Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5334045Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5335158Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5336278Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5337399Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5338548Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5338795Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.5338861Z Autotune Choices Stats: 2025-12-04T09:14:57.5340289Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5340718Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5341047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5341663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5342859Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5344009Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5345166Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5346347Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5347497Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5348649Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5349798Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5351043Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5352214Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5353371Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5353612Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.5353780Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5353849Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5353910Z unimplemented [] 2025-12-04T09:14:57.5354018Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5354201Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5355389Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5355449Z graph_break [] 2025-12-04T09:14:57.5355582Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5355647Z Autotune Choices Stats: 2025-12-04T09:14:57.5357030Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5357288Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5357507Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5357829Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5359042Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5360185Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5361302Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5362598Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5363777Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5364880Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5365128Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.5365200Z Autotune Choices Stats: 2025-12-04T09:14:57.5366616Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.5367054Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5367525Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5368092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5369268Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5370425Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5371622Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5372819Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5373991Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5375143Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5376362Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5377543Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5378699Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5379853Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5380128Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.5380268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5380338Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5380398Z unimplemented [] 2025-12-04T09:14:57.5380508Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5380690Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5381871Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5381932Z graph_break [] 2025-12-04T09:14:57.5382059Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5382137Z Autotune Choices Stats: 2025-12-04T09:14:57.5383515Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.5383764Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5383980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5384414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5385534Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5386651Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5387758Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5388917Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5390043Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5391167Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5391417Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.5391483Z Autotune Choices Stats: 2025-12-04T09:14:57.5392963Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5393433Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5393754Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5394318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5395478Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5396632Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5397813Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5398969Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5400122Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5401330Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5402510Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5403673Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5404825Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5406016Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5406259Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.5406391Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5406458Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5406522Z unimplemented [] 2025-12-04T09:14:57.5406630Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5406813Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5407994Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5408058Z graph_break [] 2025-12-04T09:14:57.5408182Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5408254Z Autotune Choices Stats: 2025-12-04T09:14:57.5409708Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5409988Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5410206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5410524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5411644Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5412812Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5413972Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5415090Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5416203Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5417315Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5417567Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.5421780Z Autotune Choices Stats: 2025-12-04T09:14:57.5423522Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5424027Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5424371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5424960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5426145Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5427348Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5428512Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5429660Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5430809Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5432020Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5433207Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5434358Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5435531Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5436679Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5436941Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.5437089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5437174Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5437240Z unimplemented [] 2025-12-04T09:14:57.5437361Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5437559Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5438758Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5438821Z graph_break [] 2025-12-04T09:14:57.5438952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5439031Z Autotune Choices Stats: 2025-12-04T09:14:57.5440481Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5440766Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5440993Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5441313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5442437Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5443551Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5444697Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5445810Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5446931Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5448046Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5448445Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.5448520Z Autotune Choices Stats: 2025-12-04T09:14:57.5449953Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5450400Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5450723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5451289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5452588Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5453745Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5454900Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5456054Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5457277Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5458454Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5459617Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5460779Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5462155Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5463347Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5463605Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.5463745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5463816Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5463881Z unimplemented [] 2025-12-04T09:14:57.5463997Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5464185Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5465374Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5465610Z graph_break [] 2025-12-04T09:14:57.5465742Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5465817Z Autotune Choices Stats: 2025-12-04T09:14:57.5467219Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.5467471Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5467694Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5468022Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5469170Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5470341Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5471455Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5472572Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5473681Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5474858Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5475149Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.5475226Z Autotune Choices Stats: 2025-12-04T09:14:57.5476650Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5477091Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5477414Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5478013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5479193Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5480353Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5481506Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5482733Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5483914Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5485060Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5486215Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5487391Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5488556Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5489711Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5489961Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.5490097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5490168Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5490232Z unimplemented [] 2025-12-04T09:14:57.5490355Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5490541Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5491778Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5491878Z graph_break [] 2025-12-04T09:14:57.5492006Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5492083Z Autotune Choices Stats: 2025-12-04T09:14:57.5493515Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.5493769Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5493987Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5494344Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5495471Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5496591Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5497700Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5498815Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5499982Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5501136Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5501381Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.5501453Z Autotune Choices Stats: 2025-12-04T09:14:57.5502873Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5503343Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5503670Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5504239Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5505398Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5506550Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5507698Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5508942Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5510094Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5511252Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5512438Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5513600Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5514753Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5515908Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5516159Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.5516299Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5516425Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5516548Z unimplemented [] 2025-12-04T09:14:57.5516667Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5516853Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5518037Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5518104Z graph_break [] 2025-12-04T09:14:57.5518233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5518309Z Autotune Choices Stats: 2025-12-04T09:14:57.5519698Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5519980Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5520197Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5520520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5521653Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5522776Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5523885Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5525068Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5526207Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5527324Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5527569Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.5527654Z Autotune Choices Stats: 2025-12-04T09:14:57.5529085Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.5529561Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5529888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5530455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5531622Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5532835Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5534056Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5535244Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5536396Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5537542Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5538741Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5539888Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5541044Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5542259Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5542539Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.5542668Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5542745Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5542807Z unimplemented [] 2025-12-04T09:14:57.5542918Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5543105Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5544285Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5544354Z graph_break [] 2025-12-04T09:14:57.5544482Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5544553Z Autotune Choices Stats: 2025-12-04T09:14:57.5545944Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5546260Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5546479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5546794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5547924Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5549053Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5550169Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5551376Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5552489Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5553608Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5553882Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.5553957Z Autotune Choices Stats: 2025-12-04T09:14:57.5555379Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5555820Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5556148Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5556717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5557883Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5559115Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5560295Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5561454Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5562776Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5564021Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5565182Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5566338Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5567492Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5568790Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5569041Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.5569215Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.5569299Z Traceback (most recent call last): 2025-12-04T09:14:57.5569601Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.5569677Z self.assertTrue( 2025-12-04T09:14:57.5569888Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.5569970Z raise self.failureException(msg) 2025-12-04T09:14:57.5570216Z AssertionError: False is not true : Log file /tmp/tmpede0d40r/flex_attention_configs.json was not created 2025-12-04T09:14:57.5570221Z 2025-12-04T09:14:57.5570356Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.5570608Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.5570654Z 2025-12-04T09:14:57.5570830Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.5570964Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5571041Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5571105Z unimplemented [] 2025-12-04T09:14:57.5571210Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5572447Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.5572632Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5572710Z graph_break [] 2025-12-04T09:14:57.5572840Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5573831Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.5573915Z current_size = base.storage().size() 2025-12-04T09:14:57.5573984Z Autotune Choices Stats: 2025-12-04T09:14:57.5575387Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.5575636Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5575994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5576312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5577445Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5578556Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5579668Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5580848Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5581954Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5583078Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5583332Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.5583400Z Autotune Choices Stats: 2025-12-04T09:14:57.5584887Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.5585353Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5585685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5586243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5587416Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5588605Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5589747Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5590901Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5592047Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5593253Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5594426Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5595577Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5596731Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5597909Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5598163Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.5598295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5598378Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5598449Z unimplemented [] 2025-12-04T09:14:57.5598556Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5598747Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5599922Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5599990Z graph_break [] 2025-12-04T09:14:57.5600119Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5600188Z Autotune Choices Stats: 2025-12-04T09:14:57.5601638Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5601912Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5602137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5602453Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5603594Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5604702Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5605852Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5606962Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5608080Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5609201Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5609444Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.5609512Z Autotune Choices Stats: 2025-12-04T09:14:57.5611052Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5611485Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5611818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5612432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5613612Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5614806Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5615959Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5617117Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5618324Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5619512Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5620660Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5621812Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5623003Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5624145Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5624394Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.5624522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5624592Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5624661Z unimplemented [] 2025-12-04T09:14:57.5624765Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5624958Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5626128Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5626195Z graph_break [] 2025-12-04T09:14:57.5626326Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5626394Z Autotune Choices Stats: 2025-12-04T09:14:57.5627886Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5628138Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5628362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5628687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5629814Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5630954Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5632077Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5633192Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5634302Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5635477Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5635754Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.5635820Z Autotune Choices Stats: 2025-12-04T09:14:57.5637250Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5637686Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5638014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5638574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5639783Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5640932Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5642100Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5643249Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5644472Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5645651Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5646799Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5647947Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5649139Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5650287Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5650541Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.5650670Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5650741Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5650809Z unimplemented [] 2025-12-04T09:14:57.5650911Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5651103Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5652422Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5652531Z graph_break [] 2025-12-04T09:14:57.5652672Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5652739Z Autotune Choices Stats: 2025-12-04T09:14:57.5654125Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5654375Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5654595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5654909Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5656038Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5657188Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5658303Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5659424Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5660531Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5661729Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5662143Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.5662218Z Autotune Choices Stats: 2025-12-04T09:14:57.5663651Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5664090Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5664484Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5665043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5666221Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5667387Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5668544Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5669791Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5670983Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5672140Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5673286Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5674487Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5675633Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5676780Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5677033Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.5677165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5677237Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5677305Z unimplemented [] 2025-12-04T09:14:57.5677417Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5677609Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5678882Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5678949Z graph_break [] 2025-12-04T09:14:57.5679085Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5679151Z Autotune Choices Stats: 2025-12-04T09:14:57.5680537Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5680782Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5681002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5681352Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5682477Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5683583Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5684711Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5685826Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5687001Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5688138Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5688386Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.5688463Z Autotune Choices Stats: 2025-12-04T09:14:57.5689885Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5690347Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5690677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5691235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5692440Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5693607Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5694827Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5696005Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5697156Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5698311Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5699497Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5700649Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5701800Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5702950Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5703199Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.5703442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5703526Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5703596Z unimplemented [] 2025-12-04T09:14:57.5703702Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5703885Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5705069Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5705131Z graph_break [] 2025-12-04T09:14:57.5705263Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5705333Z Autotune Choices Stats: 2025-12-04T09:14:57.5706715Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.5706992Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5707214Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5707531Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5708668Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5709795Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5710906Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5712082Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5713267Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5714379Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5714629Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.5714707Z Autotune Choices Stats: 2025-12-04T09:14:57.5716137Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5716605Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5716935Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5717497Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5718664Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5719822Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5721040Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5722229Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5723382Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5724568Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5725717Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5726882Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5728041Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5729261Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5729547Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.5729675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5729742Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5729816Z unimplemented [] 2025-12-04T09:14:57.5729920Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5730104Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5731286Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5731349Z graph_break [] 2025-12-04T09:14:57.5731485Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5731552Z Autotune Choices Stats: 2025-12-04T09:14:57.5733012Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5733294Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5733514Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5733827Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5734956Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5736069Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5737267Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5738401Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5739522Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5740630Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5740919Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.5740986Z Autotune Choices Stats: 2025-12-04T09:14:57.5742412Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.5742844Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5743171Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5743736Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5744906Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5746121Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5747325Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5748485Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5749642Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5750828Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5751974Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5753122Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5754266Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5755509Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5755757Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.5755882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5755949Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5756015Z unimplemented [] 2025-12-04T09:14:57.5756115Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5756298Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5757475Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5757567Z graph_break [] 2025-12-04T09:14:57.5757697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5757763Z Autotune Choices Stats: 2025-12-04T09:14:57.5759147Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.5759384Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5759604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5759915Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5761039Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5762331Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5763607Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5764729Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5765849Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5766959Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5767268Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.5767334Z Autotune Choices Stats: 2025-12-04T09:14:57.5768752Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5769181Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5769507Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5770061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5771308Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5772528Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5773702Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5774853Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5776040Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5777193Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5778339Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5779502Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5780712Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5781900Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5782154Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.5782283Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5782352Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5782420Z unimplemented [] 2025-12-04T09:14:57.5782521Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5782701Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5783884Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5783983Z graph_break [] 2025-12-04T09:14:57.5784118Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5784184Z Autotune Choices Stats: 2025-12-04T09:14:57.5785576Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5785820Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5786041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5786361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5787493Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5788682Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5789841Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5790956Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5792078Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5793219Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5793472Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.5793542Z Autotune Choices Stats: 2025-12-04T09:14:57.5794968Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5795403Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5795733Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5796293Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5797558Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5798714Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5799872Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5801067Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5802221Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5803385Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5804530Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5805756Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5806938Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5808091Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5808341Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.5808474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5808574Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5808640Z unimplemented [] 2025-12-04T09:14:57.5808741Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5808924Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5810103Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5810165Z graph_break [] 2025-12-04T09:14:57.5810298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5810364Z Autotune Choices Stats: 2025-12-04T09:14:57.5811745Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5811985Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5812200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5812566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5813768Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5814914Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5816036Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5817145Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5818318Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5819437Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5819688Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.5819751Z Autotune Choices Stats: 2025-12-04T09:14:57.5821175Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5821609Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5822001Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5822591Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5823759Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5824912Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5826065Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5827244Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5828409Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5829576Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5830790Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5831987Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5833136Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5834289Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5834560Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.5834694Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5834764Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5834835Z unimplemented [] 2025-12-04T09:14:57.5834939Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5835122Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5836300Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5836361Z graph_break [] 2025-12-04T09:14:57.5836497Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5836562Z Autotune Choices Stats: 2025-12-04T09:14:57.5837940Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.5838185Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5838400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5838784Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5839952Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5841072Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5842187Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5843332Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5844438Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5845549Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5845800Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.5845866Z Autotune Choices Stats: 2025-12-04T09:14:57.5847289Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5847857Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5848186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5848748Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5849920Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5851072Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5852253Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5853458Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5854611Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5855760Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5857014Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5858164Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5859319Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5860505Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5860749Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.5860883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5860951Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5861021Z unimplemented [] 2025-12-04T09:14:57.5861122Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5861308Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5862648Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5862714Z graph_break [] 2025-12-04T09:14:57.5862847Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5862914Z Autotune Choices Stats: 2025-12-04T09:14:57.5864297Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.5864831Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5865056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5865390Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5866533Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5867656Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5868768Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5869949Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5871062Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5872175Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5872425Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.5872490Z Autotune Choices Stats: 2025-12-04T09:14:57.5873980Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5874441Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5874770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5875332Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5876504Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5877684Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5878855Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5880007Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5881170Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5882386Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5883564Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5884716Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5885860Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5887041Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5887287Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.5887425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5887495Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5887563Z unimplemented [] 2025-12-04T09:14:57.5887667Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5887852Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5889030Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5889093Z graph_break [] 2025-12-04T09:14:57.5889235Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5889306Z Autotune Choices Stats: 2025-12-04T09:14:57.5890754Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5891033Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5891251Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5891572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5892778Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5893902Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5895050Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5896163Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5897282Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5898394Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5898643Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.5898824Z Autotune Choices Stats: 2025-12-04T09:14:57.5900254Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.5900686Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5901023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5901578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5902783Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5903937Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5905093Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5906245Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5907494Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5908684Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5909833Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5910983Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5912165Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5913319Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5913563Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.5913702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5913770Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5913835Z unimplemented [] 2025-12-04T09:14:57.5913945Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5914129Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5915311Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5915374Z graph_break [] 2025-12-04T09:14:57.5915610Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5915679Z Autotune Choices Stats: 2025-12-04T09:14:57.5917065Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5917319Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5917538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5917860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5918988Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5920143Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5921256Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5922373Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5923497Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5924673Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5924960Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.5925028Z Autotune Choices Stats: 2025-12-04T09:14:57.5926461Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.5926894Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5927227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5927832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5929002Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5930153Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5931327Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5932511Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5933773Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5934926Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5936078Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5937264Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5938423Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5939581Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5939828Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.5939959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5940026Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5940089Z unimplemented [] 2025-12-04T09:14:57.5940196Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5940376Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5941619Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5941712Z graph_break [] 2025-12-04T09:14:57.5941844Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5941910Z Autotune Choices Stats: 2025-12-04T09:14:57.5943295Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5943543Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5943756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5944074Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5945241Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5946355Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5947470Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5948589Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5949767Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5950909Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5951154Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:57.5951219Z Autotune Choices Stats: 2025-12-04T09:14:57.5952663Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.5953132Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5953454Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5954022Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5955185Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5956343Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5957503Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5958720Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5959926Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5961075Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5962384Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5963602Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5964752Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5965908Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.5966152Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:57.5966330Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.5966408Z Traceback (most recent call last): 2025-12-04T09:14:57.5966706Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.5966915Z self.assertTrue( 2025-12-04T09:14:57.5967118Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.5967206Z raise self.failureException(msg) 2025-12-04T09:14:57.5967445Z AssertionError: False is not true : Log file /tmp/tmpqpok_9h7/flex_attention_configs.json was not created 2025-12-04T09:14:57.5967449Z 2025-12-04T09:14:57.5967585Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.5967845Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.5967849Z 2025-12-04T09:14:57.5968013Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.5968155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5968242Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5968308Z unimplemented [] 2025-12-04T09:14:57.5968424Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5969607Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.5969845Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5969910Z graph_break [] 2025-12-04T09:14:57.5970040Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5971049Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.5971135Z current_size = base.storage().size() 2025-12-04T09:14:57.5971213Z Autotune Choices Stats: 2025-12-04T09:14:57.5972641Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.5972915Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5973133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5973455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5974587Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5975808Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5976917Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5978031Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.5979169Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.5980286Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5980534Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.5980607Z Autotune Choices Stats: 2025-12-04T09:14:57.5982030Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.5982471Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5982802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5983460Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.5984653Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5985806Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5986956Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5988138Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5989299Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5990442Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.5991593Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.5992816Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5993999Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.5995152Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.5995398Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.5995565Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.5995641Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.5995703Z unimplemented [] 2025-12-04T09:14:57.5995804Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.5995993Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.5997163Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.5997232Z graph_break [] 2025-12-04T09:14:57.5997358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.5997443Z Autotune Choices Stats: 2025-12-04T09:14:57.5998828Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.5999074Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.5999291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.5999609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6000810Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6001946Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6003058Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6004175Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6005317Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6006430Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6006673Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.6006747Z Autotune Choices Stats: 2025-12-04T09:14:57.6008162Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6008596Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6009016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6009577Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6010740Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6011896Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6013115Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6014266Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6015417Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6016559Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6017783Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6018963Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6020116Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6021268Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6021540Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.6021672Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6021746Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6021808Z unimplemented [] 2025-12-04T09:14:57.6021909Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6022097Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6023270Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6023339Z graph_break [] 2025-12-04T09:14:57.6023469Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6023543Z Autotune Choices Stats: 2025-12-04T09:14:57.6024919Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6025172Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6025388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6025813Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6026945Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6028065Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6029190Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6030341Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6031449Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6032565Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6032807Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.6032880Z Autotune Choices Stats: 2025-12-04T09:14:57.6034362Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6034829Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6035153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6035721Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6036883Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6038040Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6039227Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6040375Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6041527Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6042742Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6043932Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6045077Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6046234Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6047411Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6047664Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.6047792Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6047870Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6047936Z unimplemented [] 2025-12-04T09:14:57.6048043Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6048236Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6049411Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6049481Z graph_break [] 2025-12-04T09:14:57.6049615Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6049683Z Autotune Choices Stats: 2025-12-04T09:14:57.6051137Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6051423Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6051641Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6051956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6053149Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6054263Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6055437Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6056551Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6057656Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6058774Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6059029Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.6059103Z Autotune Choices Stats: 2025-12-04T09:14:57.6060592Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6061062Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6061385Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6062115Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6063306Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6064533Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6065682Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6066835Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6067984Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6069225Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6070430Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6071582Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6072738Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6073917Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6074174Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.6074308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6074386Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6074448Z unimplemented [] 2025-12-04T09:14:57.6074552Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6074750Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6075921Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6075988Z graph_break [] 2025-12-04T09:14:57.6076116Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6076183Z Autotune Choices Stats: 2025-12-04T09:14:57.6077638Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6077915Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6078136Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6078449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6079583Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6080694Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6081842Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6082958Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6084061Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6085173Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6085526Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.6085603Z Autotune Choices Stats: 2025-12-04T09:14:57.6087020Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6087463Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6087787Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6088354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6089670Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6090830Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6091976Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6093182Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6094400Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6095586Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6096736Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6097879Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6099060Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6100203Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6100455Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.6100589Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6100668Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6100743Z unimplemented [] 2025-12-04T09:14:57.6100849Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6101046Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6102217Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6102323Z graph_break [] 2025-12-04T09:14:57.6102508Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6102580Z Autotune Choices Stats: 2025-12-04T09:14:57.6103965Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.6104207Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6104432Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6104746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6105871Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6107017Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6108140Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6109268Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6110393Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6111572Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6111845Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.6111916Z Autotune Choices Stats: 2025-12-04T09:14:57.6113335Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6113770Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6114091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6114708Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6115879Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6117033Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6118181Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6119419Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6120608Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6121776Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6122928Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6124108Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6125274Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6126420Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6126667Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.6126794Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6126877Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6126937Z unimplemented [] 2025-12-04T09:14:57.6127037Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6127225Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6128455Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6128556Z graph_break [] 2025-12-04T09:14:57.6128683Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6128750Z Autotune Choices Stats: 2025-12-04T09:14:57.6130139Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6130379Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6130598Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6130955Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6132092Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6133259Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6134387Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6135503Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6136685Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6137834Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6138078Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.6138144Z Autotune Choices Stats: 2025-12-04T09:14:57.6139569Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.6140042Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6140369Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6140932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6142108Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6143271Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6144418Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6145665Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6146820Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6147976Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6149160Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6150305Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6151462Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6152605Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6152857Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.6152985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6153060Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6153179Z unimplemented [] 2025-12-04T09:14:57.6153341Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6153535Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6154706Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6154773Z graph_break [] 2025-12-04T09:14:57.6154898Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6154964Z Autotune Choices Stats: 2025-12-04T09:14:57.6156360Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.6156637Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6156861Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6157175Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6158306Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6159420Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6160547Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6161726Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6163058Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6164179Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6164425Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.6164491Z Autotune Choices Stats: 2025-12-04T09:14:57.6165923Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6166432Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6166766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6167328Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6168513Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6169681Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6170920Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6172113Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6173331Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6174506Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6175720Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6176876Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6178051Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6179272Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6179553Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.6179684Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6179759Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6179820Z unimplemented [] 2025-12-04T09:14:57.6179921Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6180107Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6181297Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6181363Z graph_break [] 2025-12-04T09:14:57.6181490Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6181557Z Autotune Choices Stats: 2025-12-04T09:14:57.6182952Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6183253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6183475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6183791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6184920Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6186047Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6187164Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6188368Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6189491Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6190608Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6190883Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.6190948Z Autotune Choices Stats: 2025-12-04T09:14:57.6192376Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6192808Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6193139Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6193701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6194875Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6196094Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6197272Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6198423Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6199588Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6200777Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6201923Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6203076Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6204230Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6205448Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6205728Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.6205858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6205932Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6205995Z unimplemented [] 2025-12-04T09:14:57.6206094Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6206294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6207475Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6207543Z graph_break [] 2025-12-04T09:14:57.6207704Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6207770Z Autotune Choices Stats: 2025-12-04T09:14:57.6209161Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6209403Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6209626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6209941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6211072Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6212185Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6213416Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6214579Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6215697Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6216814Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6217090Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.6217159Z Autotune Choices Stats: 2025-12-04T09:14:57.6218586Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6219017Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6219351Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6219907Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6221077Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6222320Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6223483Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6224637Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6225817Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6226975Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6228128Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6229285Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6230508Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6231686Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6231936Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.6232067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6232135Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6232207Z unimplemented [] 2025-12-04T09:14:57.6232308Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6232497Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6233664Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6233773Z graph_break [] 2025-12-04T09:14:57.6233901Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6233971Z Autotune Choices Stats: 2025-12-04T09:14:57.6235359Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.6235602Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6235822Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6236140Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6237269Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6238456Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6239612Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6240726Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6241845Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6242997Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6243239Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.6243304Z Autotune Choices Stats: 2025-12-04T09:14:57.6244736Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6245164Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6245492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6246051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6247303Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6248530Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6249686Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6250841Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6252030Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6253243Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6254396Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6255620Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6256798Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6257951Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6258202Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.6258325Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6258393Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6258460Z unimplemented [] 2025-12-04T09:14:57.6258594Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6258787Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6259960Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6260021Z graph_break [] 2025-12-04T09:14:57.6260153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6260217Z Autotune Choices Stats: 2025-12-04T09:14:57.6261608Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.6261997Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6262228Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6262553Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6263788Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6264950Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6266089Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6267208Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6268378Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6269493Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6269745Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.6269811Z Autotune Choices Stats: 2025-12-04T09:14:57.6271242Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6271675Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6272006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6272656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6273824Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6274983Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6276140Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6277329Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6278490Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6279646Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6280794Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6282056Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6283209Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6284363Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6284645Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.6284772Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6284840Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6284909Z unimplemented [] 2025-12-04T09:14:57.6285010Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6285204Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6286382Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6286455Z graph_break [] 2025-12-04T09:14:57.6286597Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6286666Z Autotune Choices Stats: 2025-12-04T09:14:57.6288058Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6288298Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6288523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6288836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6290062Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6291180Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6292367Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6293484Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6294644Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6295757Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6296009Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.6296076Z Autotune Choices Stats: 2025-12-04T09:14:57.6297518Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.6298019Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6298383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6298948Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6300115Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6301264Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6302457Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6303615Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6304769Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6305921Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6307235Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6308424Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6309603Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6310754Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6311054Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.6311181Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6311247Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6311315Z unimplemented [] 2025-12-04T09:14:57.6311415Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6311600Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6312780Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6312844Z graph_break [] 2025-12-04T09:14:57.6312975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6313043Z Autotune Choices Stats: 2025-12-04T09:14:57.6314431Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6314670Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6314983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6315298Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6316428Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6317544Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6318670Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6319810Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6320928Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6322055Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6322303Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.6322371Z Autotune Choices Stats: 2025-12-04T09:14:57.6323855Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6324317Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6324646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6325205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6326378Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6327574Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6328730Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6329889Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6331035Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6332252Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6333513Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6334673Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6335832Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6337016Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6337265Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.6337392Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6337462Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6337528Z unimplemented [] 2025-12-04T09:14:57.6337630Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6337811Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6338997Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6339057Z graph_break [] 2025-12-04T09:14:57.6339189Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6339257Z Autotune Choices Stats: 2025-12-04T09:14:57.6340718Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6341007Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6341227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6341544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6342666Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6343808Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6344968Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6346090Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6347211Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6348322Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6348581Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:57.6348649Z Autotune Choices Stats: 2025-12-04T09:14:57.6350173Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.6350609Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6350942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6351514Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6352686Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6353876Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6355035Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6356196Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6357411Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6358599Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6359750Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6360904Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6362426Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6363625Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6363885Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:57.6364017Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6364092Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6364166Z unimplemented [] 2025-12-04T09:14:57.6364271Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6364458Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6365638Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6365700Z graph_break [] 2025-12-04T09:14:57.6365837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6365902Z Autotune Choices Stats: 2025-12-04T09:14:57.6367476Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6367728Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6367955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6368278Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6369419Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6370594Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6371719Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6372891Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6374013Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6375220Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6375503Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:57.6375570Z Autotune Choices Stats: 2025-12-04T09:14:57.6377001Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:57.6377443Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6377772Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6378361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6379537Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6380687Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6381853Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6382996Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6384232Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6385416Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6386568Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6387717Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6388903Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6390060Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6390310Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:57.6390485Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.6390563Z Traceback (most recent call last): 2025-12-04T09:14:57.6390865Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.6390931Z self.assertTrue( 2025-12-04T09:14:57.6391127Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.6391215Z raise self.failureException(msg) 2025-12-04T09:14:57.6391451Z AssertionError: False is not true : Log file /tmp/tmp_l7fygn8/flex_attention_configs.json was not created 2025-12-04T09:14:57.6391456Z 2025-12-04T09:14:57.6391597Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.6391941Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.6391945Z 2025-12-04T09:14:57.6392108Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.6392260Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6392331Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6392399Z unimplemented [] 2025-12-04T09:14:57.6392502Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6393689Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.6393882Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6393942Z graph_break [] 2025-12-04T09:14:57.6394074Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6395057Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.6395173Z current_size = base.storage().size() 2025-12-04T09:14:57.6395250Z Autotune Choices Stats: 2025-12-04T09:14:57.6396645Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.6396899Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6397117Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6397446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6398587Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6399698Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6400910Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6406887Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6408056Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6409282Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6409548Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.6409621Z Autotune Choices Stats: 2025-12-04T09:14:57.6411058Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.6411509Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6411840Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6412478Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6413751Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6414949Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6416121Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6417271Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6418458Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6419608Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6420768Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6421914Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6423117Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6424312Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6424573Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.6424730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6424804Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6424869Z unimplemented [] 2025-12-04T09:14:57.6424986Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6425173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6426414Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6426481Z graph_break [] 2025-12-04T09:14:57.6426611Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6426683Z Autotune Choices Stats: 2025-12-04T09:14:57.6428070Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6428323Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6428545Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6428867Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6429988Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6431160Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6432305Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6433426Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6434532Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6435675Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6435924Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.6435999Z Autotune Choices Stats: 2025-12-04T09:14:57.6437432Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6437870Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6438193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6438765Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6440017Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6441176Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6442322Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6443497Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6444645Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6445788Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6446929Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6448135Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6449308Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6450459Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6450706Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.6450844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6450947Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6451011Z unimplemented [] 2025-12-04T09:14:57.6451122Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6451306Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6452542Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6452606Z graph_break [] 2025-12-04T09:14:57.6452734Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6452808Z Autotune Choices Stats: 2025-12-04T09:14:57.6454194Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6454445Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6454663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6454986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6456201Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6457376Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6458492Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6459610Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6460751Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6462014Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6462271Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.6462347Z Autotune Choices Stats: 2025-12-04T09:14:57.6463778Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6464220Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6464655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6465268Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6466447Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6467616Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6468786Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6469989Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6471137Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6472283Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6473526Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6474705Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6475857Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6477005Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6477287Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.6477428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6477499Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6477567Z unimplemented [] 2025-12-04T09:14:57.6477696Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6477885Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6479062Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6479136Z graph_break [] 2025-12-04T09:14:57.6479267Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6479340Z Autotune Choices Stats: 2025-12-04T09:14:57.6480730Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6480979Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6481199Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6481591Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6482752Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6483867Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6484973Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6486118Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6487219Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6488332Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6488578Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.6488648Z Autotune Choices Stats: 2025-12-04T09:14:57.6490067Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6490620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6490947Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6491511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6492733Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6493895Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6495076Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6496229Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6497374Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6498514Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6499723Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6500897Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6502047Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6503226Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6503468Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.6503596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6503668Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6503729Z unimplemented [] 2025-12-04T09:14:57.6503834Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6504015Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6505188Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6505257Z graph_break [] 2025-12-04T09:14:57.6505386Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6505457Z Autotune Choices Stats: 2025-12-04T09:14:57.6506845Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6507196Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6507413Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6507725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6508854Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6509968Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6511075Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6512224Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6513327Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6514438Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6514690Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.6514767Z Autotune Choices Stats: 2025-12-04T09:14:57.6516245Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6516712Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6517037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6517601Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6518755Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6519946Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6521089Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6522241Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6523390Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6524610Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6525787Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6526926Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6528072Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6529256Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6529500Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.6529628Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6529702Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6529766Z unimplemented [] 2025-12-04T09:14:57.6529872Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6530051Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6531225Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6531295Z graph_break [] 2025-12-04T09:14:57.6531422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6531493Z Autotune Choices Stats: 2025-12-04T09:14:57.6532995Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.6533283Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6533501Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6533817Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6534954Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6536072Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6537214Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6538331Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6539443Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6540558Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6540804Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.6540910Z Autotune Choices Stats: 2025-12-04T09:14:57.6542390Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6542830Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6543157Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6543724Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6544882Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6546075Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6547235Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6548385Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6549606Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6550781Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6551932Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6553074Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6554267Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6555416Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6555660Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.6555788Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6555865Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6555925Z unimplemented [] 2025-12-04T09:14:57.6556030Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6556211Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6557391Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6557459Z graph_break [] 2025-12-04T09:14:57.6557585Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6557753Z Autotune Choices Stats: 2025-12-04T09:14:57.6559147Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6559395Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6559609Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6559925Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6561053Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6562402Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6563540Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6564658Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6565764Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6566990Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6567284Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.6567358Z Autotune Choices Stats: 2025-12-04T09:14:57.6568782Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.6569221Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6569545Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6570152Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6571315Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6572523Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6573679Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6574835Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6576084Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6577246Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6578399Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6579587Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6580737Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6581899Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6582143Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.6582272Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6582347Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6582410Z unimplemented [] 2025-12-04T09:14:57.6582511Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6582701Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6583939Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6584059Z graph_break [] 2025-12-04T09:14:57.6584185Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6584257Z Autotune Choices Stats: 2025-12-04T09:14:57.6585644Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.6585895Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6586112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6586431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6587597Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6588706Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6589822Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6590949Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6592116Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6593266Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6593515Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.6593587Z Autotune Choices Stats: 2025-12-04T09:14:57.6595012Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6595454Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6595809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6596381Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6597546Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6598712Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6599859Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6601081Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6602260Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6603405Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6604552Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6605728Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6606872Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6608017Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6608270Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.6608403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6608486Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6608547Z unimplemented [] 2025-12-04T09:14:57.6608651Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6608967Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6610138Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6610205Z graph_break [] 2025-12-04T09:14:57.6610334Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6610416Z Autotune Choices Stats: 2025-12-04T09:14:57.6611802Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6612049Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6612347Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6612675Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6613808Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6614917Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6616036Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6617148Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6618370Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6619487Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6619732Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.6619803Z Autotune Choices Stats: 2025-12-04T09:14:57.6621217Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6621691Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6622019Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6622579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6623740Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6624896Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6626106Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6627293Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6628438Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6629591Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6630775Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6631917Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6633070Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6634211Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6634522Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.6634682Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6634757Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6634818Z unimplemented [] 2025-12-04T09:14:57.6634924Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6635113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6636276Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6636346Z graph_break [] 2025-12-04T09:14:57.6636473Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6636542Z Autotune Choices Stats: 2025-12-04T09:14:57.6637949Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6638233Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6638454Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6638772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6639906Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6641018Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6642133Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6643310Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6644449Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6645567Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6645812Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.6645919Z Autotune Choices Stats: 2025-12-04T09:14:57.6647356Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6647806Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6648134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6648704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6649872Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6651027Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6652286Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6653512Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6654659Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6655847Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6657002Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6658149Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6659310Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6660523Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6660812Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.6660947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6661025Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6661092Z unimplemented [] 2025-12-04T09:14:57.6661199Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6661390Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6662720Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6662791Z graph_break [] 2025-12-04T09:14:57.6662920Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6663064Z Autotune Choices Stats: 2025-12-04T09:14:57.6664463Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.6664708Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6664929Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6665248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6666380Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6667490Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6668710Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6669880Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6670994Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6672114Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6672393Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.6672465Z Autotune Choices Stats: 2025-12-04T09:14:57.6673893Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6674330Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6674655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6675226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6676392Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6677611Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6678792Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6679949Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6681112Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6682328Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6683484Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6684627Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6685844Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6687018Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6687269Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.6687398Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6687473Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6687538Z unimplemented [] 2025-12-04T09:14:57.6687647Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6687838Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6689023Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6689124Z graph_break [] 2025-12-04T09:14:57.6689264Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6689331Z Autotune Choices Stats: 2025-12-04T09:14:57.6690733Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.6690976Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6691197Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6691512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6692678Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6693851Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6694996Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6696113Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6697223Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6698368Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6698617Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.6698688Z Autotune Choices Stats: 2025-12-04T09:14:57.6700110Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6700551Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6700873Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6701436Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6702659Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6703847Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6704996Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6706152Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6707330Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6708501Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6709659Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6710799Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6712059Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6713207Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6713457Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.6713585Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6713658Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6713719Z unimplemented [] 2025-12-04T09:14:57.6713820Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6714007Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6715210Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6715278Z graph_break [] 2025-12-04T09:14:57.6715405Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6715472Z Autotune Choices Stats: 2025-12-04T09:14:57.6716866Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6717110Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6717330Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6717643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6718788Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6720001Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6721111Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6722226Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6723338Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6724491Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6724731Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.6724803Z Autotune Choices Stats: 2025-12-04T09:14:57.6726230Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.6726667Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6726991Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6727618Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6728820Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6729980Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6731127Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6732354Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6733514Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6734670Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6735822Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6737033Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6738218Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6739377Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6739624Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.6739812Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6739882Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6739942Z unimplemented [] 2025-12-04T09:14:57.6740042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6740228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6741405Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6741469Z graph_break [] 2025-12-04T09:14:57.6741591Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6741657Z Autotune Choices Stats: 2025-12-04T09:14:57.6743044Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6743284Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6743503Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6743816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6745034Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6746177Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6747295Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6748402Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6749549Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6750670Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6750910Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.6750980Z Autotune Choices Stats: 2025-12-04T09:14:57.6752404Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6752838Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6753300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6753861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6755024Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6756180Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6757358Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6758510Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6759655Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6760805Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6762214Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6763427Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6764587Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6765735Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6766031Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.6766164Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6766238Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6766300Z unimplemented [] 2025-12-04T09:14:57.6766402Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6766589Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6767757Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6767823Z graph_break [] 2025-12-04T09:14:57.6767950Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6768021Z Autotune Choices Stats: 2025-12-04T09:14:57.6769410Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6769650Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6769873Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6770294Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6771426Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6772577Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6773696Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6774846Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6775958Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6777075Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6777318Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:57.6777385Z Autotune Choices Stats: 2025-12-04T09:14:57.6778893Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.6779357Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6779686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6780249Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6781414Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6782572Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6783760Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6784914Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6786072Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6787290Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6788480Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6789642Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6790794Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6791981Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6792229Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:57.6792356Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6792428Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6792490Z unimplemented [] 2025-12-04T09:14:57.6792594Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6792780Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6793955Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6794022Z graph_break [] 2025-12-04T09:14:57.6794147Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6794211Z Autotune Choices Stats: 2025-12-04T09:14:57.6795659Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6795932Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6796148Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6796460Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6797596Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6798718Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6799869Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6800977Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6802092Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6803208Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6803462Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:57.6803529Z Autotune Choices Stats: 2025-12-04T09:14:57.6805015Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:57.6805473Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6805801Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6806362Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6807522Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6808731Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6809882Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6811033Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6812180Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6813436Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6814613Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6815766Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6816950Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6818092Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6818341Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:57.6818467Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6818538Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6818605Z unimplemented [] 2025-12-04T09:14:57.6818716Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6818908Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6820077Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6820146Z graph_break [] 2025-12-04T09:14:57.6820271Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6820337Z Autotune Choices Stats: 2025-12-04T09:14:57.6821785Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6822054Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6822271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6822582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6823712Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6824815Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6825979Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6827086Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6828201Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6829321Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6829654Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:57.6829720Z Autotune Choices Stats: 2025-12-04T09:14:57.6831147Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6831575Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6831909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6832469Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6833666Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6834822Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6835980Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6837132Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6838351Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6839567Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6840719Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6841871Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6843058Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6844210Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6844464Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:57.6844636Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.6844716Z Traceback (most recent call last): 2025-12-04T09:14:57.6845032Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.6845098Z self.assertTrue( 2025-12-04T09:14:57.6845308Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.6845393Z raise self.failureException(msg) 2025-12-04T09:14:57.6845635Z AssertionError: False is not true : Log file /tmp/tmpfec1pktd/flex_attention_configs.json was not created 2025-12-04T09:14:57.6845641Z 2025-12-04T09:14:57.6845784Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.6846036Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.6846136Z 2025-12-04T09:14:57.6846307Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.6846438Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6846509Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6846579Z unimplemented [] 2025-12-04T09:14:57.6846684Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6847877Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.6848067Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6848125Z graph_break [] 2025-12-04T09:14:57.6848259Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6849251Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.6849370Z current_size = base.storage().size() 2025-12-04T09:14:57.6849439Z Autotune Choices Stats: 2025-12-04T09:14:57.6850826Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.6851078Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6851294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6851619Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6852824Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6853937Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6855109Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6856253Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6857370Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6858478Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6858767Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.6858834Z Autotune Choices Stats: 2025-12-04T09:14:57.6860257Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.6860688Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6861026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6861589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6863008Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6864219Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6865376Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6866522Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6867717Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6868868Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6870014Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6871160Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6872377Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6873605Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6873854Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.6873996Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6874068Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6874133Z unimplemented [] 2025-12-04T09:14:57.6874235Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6874416Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6875594Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6875688Z graph_break [] 2025-12-04T09:14:57.6875823Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6875890Z Autotune Choices Stats: 2025-12-04T09:14:57.6877272Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6877521Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6877738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6878057Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6879191Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6880363Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6881523Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6882641Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6883756Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6884896Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6885148Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.6885215Z Autotune Choices Stats: 2025-12-04T09:14:57.6886637Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6887070Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6887404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6887964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6889189Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6890372Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6891530Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6892751Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6893944Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6895108Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6896253Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6897465Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6898643Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6899796Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6900043Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.6900186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6900260Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6900323Z unimplemented [] 2025-12-04T09:14:57.6900471Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6900657Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6901844Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6901905Z graph_break [] 2025-12-04T09:14:57.6902045Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6902120Z Autotune Choices Stats: 2025-12-04T09:14:57.6903508Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6903761Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6903977Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6904302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6905518Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6906669Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6907785Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6908912Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6910062Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6911172Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6911421Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.6911486Z Autotune Choices Stats: 2025-12-04T09:14:57.6912919Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6913350Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6913685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6914346Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6915512Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6916665Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6917823Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6919009Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6920173Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6921317Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6922477Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6923716Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6924859Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6926010Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6926283Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.6926417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6926485Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6926549Z unimplemented [] 2025-12-04T09:14:57.6926658Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6926845Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6928015Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6928075Z graph_break [] 2025-12-04T09:14:57.6928206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6928274Z Autotune Choices Stats: 2025-12-04T09:14:57.6929658Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6929904Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6930119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6930437Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6931677Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6932851Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6933975Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6935094Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6936259Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6937368Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6937618Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.6937685Z Autotune Choices Stats: 2025-12-04T09:14:57.6939108Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6939608Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6939964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6940528Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6941688Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6942845Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6944029Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6945180Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6946337Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6947480Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6948691Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6949869Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6951017Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6952171Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.6952450Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.6952585Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6952653Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6952714Z unimplemented [] 2025-12-04T09:14:57.6952822Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6953005Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6954184Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6954257Z graph_break [] 2025-12-04T09:14:57.6954388Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6954461Z Autotune Choices Stats: 2025-12-04T09:14:57.6955839Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.6956085Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6956395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6956712Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6957831Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6958957Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6960068Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6961214Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6962510Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6963641Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6963890Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.6963956Z Autotune Choices Stats: 2025-12-04T09:14:57.6965491Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6965994Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6966317Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6966887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6968056Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6969222Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6970424Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6971572Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6972766Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6973980Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6975172Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6976328Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6977478Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6978669Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6978914Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.6979048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.6979119Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.6979181Z unimplemented [] 2025-12-04T09:14:57.6979285Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.6979467Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.6980647Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.6980708Z graph_break [] 2025-12-04T09:14:57.6980835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.6980907Z Autotune Choices Stats: 2025-12-04T09:14:57.6982356Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.6982638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6982853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6983183Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6984320Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6985440Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6986584Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6987698Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.6988808Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.6989918Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6990162Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.6990226Z Autotune Choices Stats: 2025-12-04T09:14:57.6991709Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.6992175Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.6992500Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.6993064Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.6994225Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.6995412Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.6996567Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.6997719Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.6998883Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7000170Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7001327Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7002479Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7003663Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7004810Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7006877Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.7007057Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7007138Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7007208Z unimplemented [] 2025-12-04T09:14:57.7007324Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7007517Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7008715Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7008785Z graph_break [] 2025-12-04T09:14:57.7008942Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7009012Z Autotune Choices Stats: 2025-12-04T09:14:57.7010503Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7010788Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7011015Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7011325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7012525Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7013695Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7014811Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7015974Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7017098Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7018252Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7018544Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.7018617Z Autotune Choices Stats: 2025-12-04T09:14:57.7020063Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.7020500Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7020831Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7021389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7022597Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7023749Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7024951Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7026105Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7027276Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7028448Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7029589Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7030735Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7031911Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7033100Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7033355Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.7033489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7033558Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7033626Z unimplemented [] 2025-12-04T09:14:57.7033730Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7033919Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7035127Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7035233Z graph_break [] 2025-12-04T09:14:57.7035369Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7035434Z Autotune Choices Stats: 2025-12-04T09:14:57.7036830Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.7037077Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7037297Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7037610Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7038741Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7039885Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7041026Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7042130Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7043233Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7044396Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7044648Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.7044716Z Autotune Choices Stats: 2025-12-04T09:14:57.7046141Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7046573Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7046939Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7047508Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7048679Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7049861Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7051003Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7052175Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7053415Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7054562Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7055701Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7056883Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7058062Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7059210Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7059458Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.7059588Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7059657Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7059724Z unimplemented [] 2025-12-04T09:14:57.7059825Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7060017Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7061255Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7061316Z graph_break [] 2025-12-04T09:14:57.7061452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7061519Z Autotune Choices Stats: 2025-12-04T09:14:57.7063055Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7063303Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7063524Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7063930Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7065067Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7066218Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7067328Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7068429Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7069588Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7070741Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7070991Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.7071061Z Autotune Choices Stats: 2025-12-04T09:14:57.7072482Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7072952Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7073281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7073839Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7075032Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7076182Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7077357Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7078625Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7079769Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7080907Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7082078Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7083220Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7084399Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7085539Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7085789Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.7085948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7086046Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7086124Z unimplemented [] 2025-12-04T09:14:57.7086229Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7086412Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7087590Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7087652Z graph_break [] 2025-12-04T09:14:57.7087784Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7087853Z Autotune Choices Stats: 2025-12-04T09:14:57.7089233Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7089515Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7089736Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7090065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7091266Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7092506Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7093626Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7094762Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7095916Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7097017Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7097264Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.7097331Z Autotune Choices Stats: 2025-12-04T09:14:57.7098757Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7099223Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7099551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7100144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7101415Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7102614Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7103809Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7104995Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7106141Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7107327Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7108473Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7109655Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7110866Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7112050Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7112347Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.7112478Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7112547Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7112616Z unimplemented [] 2025-12-04T09:14:57.7112720Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7112904Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7114095Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7114157Z graph_break [] 2025-12-04T09:14:57.7114289Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7114355Z Autotune Choices Stats: 2025-12-04T09:14:57.7115749Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.7116032Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7116253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7116578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7117742Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7118860Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7120074Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7121383Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7122627Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7123740Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7124035Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.7124103Z Autotune Choices Stats: 2025-12-04T09:14:57.7125531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7125968Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7126339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7126899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7128073Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7129282Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7130469Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7131623Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7132829Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7134016Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7135197Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7136350Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7137494Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7138707Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7138957Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.7139088Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7139157Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7139228Z unimplemented [] 2025-12-04T09:14:57.7139326Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7139514Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7140694Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7140789Z graph_break [] 2025-12-04T09:14:57.7140918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7140985Z Autotune Choices Stats: 2025-12-04T09:14:57.7142382Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.7142626Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7142847Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7143240Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7144365Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7145475Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7146658Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7147769Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7148892Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7150005Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7150283Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.7150349Z Autotune Choices Stats: 2025-12-04T09:14:57.7151822Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7152260Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7152595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7153153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7154358Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7155536Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7156690Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7157844Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7159044Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7160249Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7161396Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7162727Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7163951Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7165162Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7165423Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.7165555Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7165624Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7165691Z unimplemented [] 2025-12-04T09:14:57.7165794Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7165979Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7167155Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7167274Z graph_break [] 2025-12-04T09:14:57.7167399Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7167463Z Autotune Choices Stats: 2025-12-04T09:14:57.7168904Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7169151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7169374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7169686Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7170816Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7171958Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7173168Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7174280Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7175400Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7176547Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7176789Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.7176857Z Autotune Choices Stats: 2025-12-04T09:14:57.7178308Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.7178746Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7179072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7179631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7180859Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7182017Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7183172Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7184371Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7185512Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7186693Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7187833Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7189025Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7190202Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7191345Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7191594Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.7191720Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7191838Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7191904Z unimplemented [] 2025-12-04T09:14:57.7192004Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7192199Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7193385Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7193452Z graph_break [] 2025-12-04T09:14:57.7193579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7193646Z Autotune Choices Stats: 2025-12-04T09:14:57.7195073Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7195317Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7195537Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7195854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7197082Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7198225Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7199354Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7200460Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7201606Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7202712Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7202991Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.7203058Z Autotune Choices Stats: 2025-12-04T09:14:57.7204486Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7204922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7205285Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7205873Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7207041Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7208192Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7209355Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7210542Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7211717Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7212908Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7214090Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7215281Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7216438Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7217584Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7217865Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.7217993Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7218062Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7218132Z unimplemented [] 2025-12-04T09:14:57.7218232Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7218419Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7219598Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7219697Z graph_break [] 2025-12-04T09:14:57.7219832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7219900Z Autotune Choices Stats: 2025-12-04T09:14:57.7221284Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7221524Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7221745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7222094Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7223281Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7224404Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7225520Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7226662Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7227783Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7228928Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7229177Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:57.7229242Z Autotune Choices Stats: 2025-12-04T09:14:57.7230669Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.7231166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7231494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7232051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7233220Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7234369Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7235552Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7236735Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7237884Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7239041Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7240248Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7241398Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7242542Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7243717Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7243963Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:57.7244088Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7244153Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7244218Z unimplemented [] 2025-12-04T09:14:57.7244315Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7244508Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7245719Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7245782Z graph_break [] 2025-12-04T09:14:57.7245913Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7245977Z Autotune Choices Stats: 2025-12-04T09:14:57.7247367Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7247674Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7247896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7248211Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7249357Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7250470Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7251588Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7252778Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7254144Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7255257Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7255506Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:57.7255572Z Autotune Choices Stats: 2025-12-04T09:14:57.7257039Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:57.7257497Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7257829Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7258390Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7259557Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7260740Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7262052Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7263263Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7264417Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7265616Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7266824Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7267976Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7269128Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7270319Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7270568Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:57.7270698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7270764Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7270863Z unimplemented [] 2025-12-04T09:14:57.7270963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7271145Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7272332Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7272395Z graph_break [] 2025-12-04T09:14:57.7272532Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7272598Z Autotune Choices Stats: 2025-12-04T09:14:57.7274027Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7274299Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7274519Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7274838Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7275968Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7277080Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7278274Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7279413Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7280527Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7281638Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7281888Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:57.7282019Z Autotune Choices Stats: 2025-12-04T09:14:57.7283444Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7283876Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7284209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7284767Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7285965Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7287118Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7288334Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7289488Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7290663Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7291844Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7293094Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7294247Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7295430Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7296610Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7296858Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:57.7296987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7297054Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7297118Z unimplemented [] 2025-12-04T09:14:57.7297216Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7297395Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7298572Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7298632Z graph_break [] 2025-12-04T09:14:57.7298830Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7298899Z Autotune Choices Stats: 2025-12-04T09:14:57.7300288Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:57.7300531Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7300751Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7301067Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7302188Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7303347Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7304461Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7305609Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7306720Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7307863Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7308141Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:57.7308206Z Autotune Choices Stats: 2025-12-04T09:14:57.7309636Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7310068Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7310396Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7310986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7312154Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7313337Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7314491Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7315642Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7316850Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7318002Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7319145Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7320342Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7321484Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7322660Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7322908Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:57.7323080Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.7323157Z Traceback (most recent call last): 2025-12-04T09:14:57.7323463Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.7323529Z self.assertTrue( 2025-12-04T09:14:57.7323739Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.7323829Z raise self.failureException(msg) 2025-12-04T09:14:57.7324067Z AssertionError: False is not true : Log file /tmp/tmp_por9z0t/flex_attention_configs.json was not created 2025-12-04T09:14:57.7324137Z 2025-12-04T09:14:57.7324278Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.7324531Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.7324536Z 2025-12-04T09:14:57.7324696Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.7324830Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7324903Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7324971Z unimplemented [] 2025-12-04T09:14:57.7325075Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7326267Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.7326459Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7326519Z graph_break [] 2025-12-04T09:14:57.7326651Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7327639Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.7327766Z current_size = base.storage().size() 2025-12-04T09:14:57.7327841Z Autotune Choices Stats: 2025-12-04T09:14:57.7329242Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.7329496Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7329746Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7330071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7331193Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7332381Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7333527Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7334647Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7335758Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7336907Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7337159Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.7337228Z Autotune Choices Stats: 2025-12-04T09:14:57.7338681Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.7339122Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7339448Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7340014Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7341204Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7342402Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7343548Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7344689Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7345873Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7347044Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7348198Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7349339Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7350561Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7351712Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7351960Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.7352095Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7352165Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7352226Z unimplemented [] 2025-12-04T09:14:57.7352335Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7352550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7353737Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7353801Z graph_break [] 2025-12-04T09:14:57.7353929Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7354004Z Autotune Choices Stats: 2025-12-04T09:14:57.7355414Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7355668Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7355883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7356199Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7357318Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7358487Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7359599Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7360716Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7362013Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7363150Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7363396Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.7363464Z Autotune Choices Stats: 2025-12-04T09:14:57.7364945Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7365398Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7365727Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7366340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7367545Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7368707Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7369854Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7371041Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7372188Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7373429Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7374577Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7375759Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7376932Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7378081Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7378329Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.7378496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7378573Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7378638Z unimplemented [] 2025-12-04T09:14:57.7378747Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7378932Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7380115Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7380177Z graph_break [] 2025-12-04T09:14:57.7380304Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7380378Z Autotune Choices Stats: 2025-12-04T09:14:57.7381801Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7382053Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7382267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7382590Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7383765Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7384913Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7386026Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7387143Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7388282Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7389425Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7389668Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.7393155Z Autotune Choices Stats: 2025-12-04T09:14:57.7394660Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7395119Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7395575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7396144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7397322Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7398486Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7399675Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7400832Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7402043Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7403195Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7404376Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7405542Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7406693Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7407838Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7408123Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.7408274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7408349Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7408413Z unimplemented [] 2025-12-04T09:14:57.7408534Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7408722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7409937Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7410008Z graph_break [] 2025-12-04T09:14:57.7410138Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7410219Z Autotune Choices Stats: 2025-12-04T09:14:57.7411711Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7411974Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7412196Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7412726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7413855Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7414964Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7416065Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7417202Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7418299Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7419456Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7419707Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.7419781Z Autotune Choices Stats: 2025-12-04T09:14:57.7421228Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7421702Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7422032Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7422600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7423753Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7424903Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7426072Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7427260Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7428409Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7429571Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7430749Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7431887Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7433028Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7434209Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7434455Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.7434592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7434665Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7434727Z unimplemented [] 2025-12-04T09:14:57.7434840Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7435023Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7436238Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7436311Z graph_break [] 2025-12-04T09:14:57.7436438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7436509Z Autotune Choices Stats: 2025-12-04T09:14:57.7437923Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7438204Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7438423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7438739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7439863Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7440972Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7442114Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7443232Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7444368Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7445487Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7445729Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.7445804Z Autotune Choices Stats: 2025-12-04T09:14:57.7447259Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7447745Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7448077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7448647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7449810Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7451013Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7452184Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7453392Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7454534Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7455705Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7456876Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7458009Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7459162Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7460344Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7460592Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.7460726Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7460844Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7460911Z unimplemented [] 2025-12-04T09:14:57.7461023Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7461208Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7462565Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7462639Z graph_break [] 2025-12-04T09:14:57.7462769Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7462841Z Autotune Choices Stats: 2025-12-04T09:14:57.7464304Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.7464609Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7464833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7465163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7466301Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7467424Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7468576Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7469740Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7470852Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7471964Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7472269Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.7472343Z Autotune Choices Stats: 2025-12-04T09:14:57.7473769Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7474204Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7474532Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7475091Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7476284Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7477438Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7478623Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7479777Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7480960Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7482151Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7483302Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7484445Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7485622Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7486797Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7487047Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.7487178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7487255Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7487317Z unimplemented [] 2025-12-04T09:14:57.7487423Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7487608Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7488789Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7488889Z graph_break [] 2025-12-04T09:14:57.7489047Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7489119Z Autotune Choices Stats: 2025-12-04T09:14:57.7490507Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7490755Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7490973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7491286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7492458Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7493607Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7494749Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7495867Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7496976Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7498117Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7498391Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.7498463Z Autotune Choices Stats: 2025-12-04T09:14:57.7499897Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.7500346Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7500676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7501274Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7502435Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7503623Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7504772Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7505957Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7507143Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7508309Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7509467Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7510645Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7511801Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7512985Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7513231Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.7513362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7513439Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7513501Z unimplemented [] 2025-12-04T09:14:57.7513605Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7513794Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7515013Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7515111Z graph_break [] 2025-12-04T09:14:57.7515238Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7515313Z Autotune Choices Stats: 2025-12-04T09:14:57.7516704Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.7516952Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7517167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7517511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7518640Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7519748Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7520893Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7522009Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7523160Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7524303Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7524550Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.7524623Z Autotune Choices Stats: 2025-12-04T09:14:57.7526039Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7526512Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7526836Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7527398Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7528590Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7529751Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7530902Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7532122Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7533378Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7534525Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7535721Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7536859Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7538039Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7539189Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7539436Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.7539564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7539636Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7539731Z unimplemented [] 2025-12-04T09:14:57.7539873Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7540064Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7541235Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7541302Z graph_break [] 2025-12-04T09:14:57.7541430Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7541502Z Autotune Choices Stats: 2025-12-04T09:14:57.7542889Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7543167Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7543381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7543693Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7544824Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7545978Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7547090Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7548301Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7549449Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7550566Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7550807Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.7550881Z Autotune Choices Stats: 2025-12-04T09:14:57.7552299Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7552775Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7553096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7553664Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7554853Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7556014Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7557194Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7558373Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7559531Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7560676Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7562004Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7563223Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7564390Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7565599Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7565894Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.7566024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7566098Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7566163Z unimplemented [] 2025-12-04T09:14:57.7566265Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7566455Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7567626Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7567695Z graph_break [] 2025-12-04T09:14:57.7567821Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7567891Z Autotune Choices Stats: 2025-12-04T09:14:57.7569281Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7569579Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7569795Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7570108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7571263Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7572419Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7573531Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7574710Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7575818Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7576933Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7577222Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.7577301Z Autotune Choices Stats: 2025-12-04T09:14:57.7578719Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7579156Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7579517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7580080Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7581238Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7582426Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7583614Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7584769Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7585917Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7587105Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7588289Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7589437Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7590592Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7591773Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7592058Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.7592187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7592260Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7592323Z unimplemented [] 2025-12-04T09:14:57.7592425Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7592624Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7593803Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7593869Z graph_break [] 2025-12-04T09:14:57.7594032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7594098Z Autotune Choices Stats: 2025-12-04T09:14:57.7595487Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.7595728Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7595953Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7596300Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7597434Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7598545Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7599691Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7600835Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7601939Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7603055Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7603326Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.7603400Z Autotune Choices Stats: 2025-12-04T09:14:57.7604821Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7605290Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7605619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7606180Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7607338Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7608585Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7609735Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7610884Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7612060Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7613262Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7614446Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7615592Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7616772Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7617944Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7618192Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.7618321Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7618394Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7618456Z unimplemented [] 2025-12-04T09:14:57.7618559Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7618744Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7619920Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7620019Z graph_break [] 2025-12-04T09:14:57.7620142Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7620210Z Autotune Choices Stats: 2025-12-04T09:14:57.7621608Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.7621884Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7622105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7622423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7623557Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7624698Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7625846Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7626962Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7628073Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7629223Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7629466Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.7629538Z Autotune Choices Stats: 2025-12-04T09:14:57.7631004Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7631445Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7631766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7632340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7633544Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7634730Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7635878Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7637034Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7638215Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7639394Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7640548Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7641996Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7643191Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7644335Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7644598Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.7644731Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7644805Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7644870Z unimplemented [] 2025-12-04T09:14:57.7645011Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7645204Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7646374Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7646442Z graph_break [] 2025-12-04T09:14:57.7646570Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7646637Z Autotune Choices Stats: 2025-12-04T09:14:57.7648069Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7648315Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7648536Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7648862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7650023Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7651161Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7652332Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7653441Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7654589Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7655699Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7655952Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.7656063Z Autotune Choices Stats: 2025-12-04T09:14:57.7657485Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.7657923Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7658250Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7658881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7660050Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7661210Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7662526Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7663788Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7664989Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7666158Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7667318Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7668557Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7669716Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7670856Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7671140Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.7671275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7671354Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7671415Z unimplemented [] 2025-12-04T09:14:57.7671520Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7671717Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7672892Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7672959Z graph_break [] 2025-12-04T09:14:57.7673097Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7673221Z Autotune Choices Stats: 2025-12-04T09:14:57.7674631Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7674879Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7675107Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7675427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7676638Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7677755Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7678889Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7680000Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7681153Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7682343Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7682592Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.7682659Z Autotune Choices Stats: 2025-12-04T09:14:57.7684084Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7684553Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7684908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7685468Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7686640Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7687794Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7688976Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7690138Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7691319Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7692537Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7693738Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7694914Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7696070Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7697218Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7697503Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.7697635Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7697712Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7697777Z unimplemented [] 2025-12-04T09:14:57.7697881Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7698070Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7699296Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7699368Z graph_break [] 2025-12-04T09:14:57.7699506Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7699573Z Autotune Choices Stats: 2025-12-04T09:14:57.7700968Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7701212Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7701499Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7701812Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7702940Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7704053Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7705169Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7706330Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7707474Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7708594Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7708837Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:57.7708906Z Autotune Choices Stats: 2025-12-04T09:14:57.7710367Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.7710834Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7711156Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7711716Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7712886Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7714073Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7715224Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7716410Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7717574Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7718767Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7719949Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7721100Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7722255Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7723440Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7723691Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:57.7723822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7723897Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7723958Z unimplemented [] 2025-12-04T09:14:57.7724058Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7724276Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7725452Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7725517Z graph_break [] 2025-12-04T09:14:57.7725645Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7725711Z Autotune Choices Stats: 2025-12-04T09:14:57.7727135Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7727405Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7727628Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7727942Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7729070Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7730181Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7731347Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7732511Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7733670Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7734786Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7735029Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:57.7735096Z Autotune Choices Stats: 2025-12-04T09:14:57.7736612Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:57.7737046Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7737370Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7737930Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7739114Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7740304Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7741483Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7742635Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7743814Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7744996Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7746142Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7747295Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7748488Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7749636Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7749919Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:57.7750052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7750127Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7750190Z unimplemented [] 2025-12-04T09:14:57.7750301Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7750493Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7751666Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7751735Z graph_break [] 2025-12-04T09:14:57.7751863Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7751929Z Autotune Choices Stats: 2025-12-04T09:14:57.7753398Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7753642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7753862Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7754177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7755304Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7756446Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7757565Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7758712Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7759831Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7760982Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7761256Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:57.7761322Z Autotune Choices Stats: 2025-12-04T09:14:57.7762995Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7763439Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7763772Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7764406Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7765574Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7766774Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7767933Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7769094Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7770307Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7771516Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7772743Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7773902Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7775094Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7776282Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7776534Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:57.7776664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7776733Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7776804Z unimplemented [] 2025-12-04T09:14:57.7776904Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7777097Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7778312Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7778413Z graph_break [] 2025-12-04T09:14:57.7778540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7778610Z Autotune Choices Stats: 2025-12-04T09:14:57.7780017Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:57.7780262Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7780488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7780801Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7781926Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7783076Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7784231Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7785351Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7786462Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7787638Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7787885Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:57.7787951Z Autotune Choices Stats: 2025-12-04T09:14:57.7789384Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7789815Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7790182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7790738Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7791902Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7793089Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7794243Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7795434Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7796617Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7797771Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7798921Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7800113Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7801333Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7802484Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7802732Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:57.7802862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7802930Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7803004Z unimplemented [] 2025-12-04T09:14:57.7803104Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7803293Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7804534Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7804599Z graph_break [] 2025-12-04T09:14:57.7804731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7804797Z Autotune Choices Stats: 2025-12-04T09:14:57.7806187Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7806428Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7806687Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7807002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7808136Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7809284Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7810401Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7811514Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7812720Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7813880Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7814131Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:57.7814200Z Autotune Choices Stats: 2025-12-04T09:14:57.7815624Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7816141Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7816472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7817031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7818229Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7819387Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7820574Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7821752Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7822900Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7824053Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7825230Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7826382Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7827569Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7828717Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7828970Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:57.7829205Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.7829284Z Traceback (most recent call last): 2025-12-04T09:14:57.7829585Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.7829650Z self.assertTrue( 2025-12-04T09:14:57.7829854Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.7829936Z raise self.failureException(msg) 2025-12-04T09:14:57.7830190Z AssertionError: False is not true : Log file /tmp/tmp2gny9fyn/flex_attention_configs.json was not created 2025-12-04T09:14:57.7830194Z 2025-12-04T09:14:57.7830340Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.7830597Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.7830601Z 2025-12-04T09:14:57.7830773Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.7830905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7830976Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7831044Z unimplemented [] 2025-12-04T09:14:57.7831149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7832342Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.7832590Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7832651Z graph_break [] 2025-12-04T09:14:57.7832785Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7833776Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.7833873Z current_size = base.storage().size() 2025-12-04T09:14:57.7833943Z Autotune Choices Stats: 2025-12-04T09:14:57.7835379Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.7835634Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7835856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7836180Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7837335Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7838473Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7839593Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7840708Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7841850Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7842987Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7843243Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.7843310Z Autotune Choices Stats: 2025-12-04T09:14:57.7844730Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.7845165Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7845536Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7846124Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7847284Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7848433Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7849601Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7850781Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7851961Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7853171Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7854362Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7855541Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7856689Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7857837Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7858114Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.7858247Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7858321Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7858387Z unimplemented [] 2025-12-04T09:14:57.7858493Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7858677Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7859871Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7859971Z graph_break [] 2025-12-04T09:14:57.7860105Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7860173Z Autotune Choices Stats: 2025-12-04T09:14:57.7861559Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7861965Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7862194Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7862635Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7863756Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7864868Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7865974Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7867140Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7868271Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7869439Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7869687Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.7869755Z Autotune Choices Stats: 2025-12-04T09:14:57.7871181Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7871682Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7872012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7872570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7873731Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7874890Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7876074Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7877250Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7878407Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7879560Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7880771Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7881921Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7883068Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7884260Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7884504Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.7884640Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7884708Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7884771Z unimplemented [] 2025-12-04T09:14:57.7884879Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7885063Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7886276Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7886340Z graph_break [] 2025-12-04T09:14:57.7886467Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7886540Z Autotune Choices Stats: 2025-12-04T09:14:57.7887924Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7888242Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7888461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7888791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7889908Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7891025Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7892142Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7893344Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7894490Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7895604Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7895851Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.7895919Z Autotune Choices Stats: 2025-12-04T09:14:57.7897389Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7897861Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7898190Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7898762Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7899920Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7901121Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7902280Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7903458Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7904604Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7905777Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7906965Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7908132Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7909295Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7910484Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7910725Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.7910858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7910927Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7910988Z unimplemented [] 2025-12-04T09:14:57.7911128Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7911309Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7912494Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7912561Z graph_break [] 2025-12-04T09:14:57.7912689Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7912761Z Autotune Choices Stats: 2025-12-04T09:14:57.7914175Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7914454Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7914672Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7914999Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7916121Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7917234Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7918381Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7919528Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7920637Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7921749Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7921994Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.7922126Z Autotune Choices Stats: 2025-12-04T09:14:57.7923539Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7923979Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7924305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7924868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7926031Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7927225Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7928429Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7929578Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7930755Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7931927Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7933124Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7934281Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7935463Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7936650Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.7936905Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.7937045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7937115Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7937177Z unimplemented [] 2025-12-04T09:14:57.7937298Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7937489Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7938675Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7938738Z graph_break [] 2025-12-04T09:14:57.7938866Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7939010Z Autotune Choices Stats: 2025-12-04T09:14:57.7940401Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7940659Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7940880Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7941207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7942328Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7943482Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7944599Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7945750Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7946860Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7948018Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7948338Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.7948406Z Autotune Choices Stats: 2025-12-04T09:14:57.7949833Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7950279Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7950603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7951198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7952356Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7953544Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7954697Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7955841Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7957063Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7958209Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7959356Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7960554Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7961694Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7963100Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7963352Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.7963490Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7963562Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7963623Z unimplemented [] 2025-12-04T09:14:57.7963733Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7963919Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7965148Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7965253Z graph_break [] 2025-12-04T09:14:57.7965381Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7965464Z Autotune Choices Stats: 2025-12-04T09:14:57.7966860Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.7967112Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7967328Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7967647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7968826Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7969950Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7971103Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7972246Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7973445Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7974607Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7974850Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.7974927Z Autotune Choices Stats: 2025-12-04T09:14:57.7976354Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.7976793Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7977154Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7977718Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7978894Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7980087Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7981244Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7982427Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7983615Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7984763Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.7985914Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7987101Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.7988283Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.7989444Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7989688Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.7989823Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.7989895Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.7989956Z unimplemented [] 2025-12-04T09:14:57.7990064Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.7990332Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.7991507Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.7991570Z graph_break [] 2025-12-04T09:14:57.7991697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.7991771Z Autotune Choices Stats: 2025-12-04T09:14:57.7993162Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.7993411Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.7993664Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.7993996Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.7995124Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7996279Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.7997396Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.7998519Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.7999689Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8000803Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8001047Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.8001120Z Autotune Choices Stats: 2025-12-04T09:14:57.8002548Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.8003037Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8003362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8003923Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8005121Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8006282Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8007471Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8008655Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8009816Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8010967Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8012150Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8013375Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8014535Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8015694Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8016004Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.8016140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8016211Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8016272Z unimplemented [] 2025-12-04T09:14:57.8016382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8016567Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8017744Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8017815Z graph_break [] 2025-12-04T09:14:57.8017943Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8018015Z Autotune Choices Stats: 2025-12-04T09:14:57.8019404Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.8019690Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8019911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8020234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8021389Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8022509Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8023628Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8024808Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8025950Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8027077Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8027323Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.8027429Z Autotune Choices Stats: 2025-12-04T09:14:57.8028848Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8029287Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8029609Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8030213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8031384Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8032579Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8033760Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8034917Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8036075Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8037260Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8038413Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8039597Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8040751Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8041946Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8042221Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.8042358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8042429Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8042491Z unimplemented [] 2025-12-04T09:14:57.8042602Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8042787Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8043964Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8044031Z graph_break [] 2025-12-04T09:14:57.8044157Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8044268Z Autotune Choices Stats: 2025-12-04T09:14:57.8045663Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8045911Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8046129Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8046449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8047615Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8048729Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8049874Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8051025Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8052142Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8053318Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8053597Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.8053670Z Autotune Choices Stats: 2025-12-04T09:14:57.8055093Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8055585Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8055911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8056477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8057642Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8058829Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8060012Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8061172Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8062493Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8063722Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8064944Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8066095Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8067297Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8068497Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8068746Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.8068881Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8068960Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8069026Z unimplemented [] 2025-12-04T09:14:57.8069139Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8069327Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8070501Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8070605Z graph_break [] 2025-12-04T09:14:57.8070735Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8070816Z Autotune Choices Stats: 2025-12-04T09:14:57.8072208Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8072466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8072721Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8073043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8074185Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8075335Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8076479Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8077605Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8078715Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8079876Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8080120Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.8080194Z Autotune Choices Stats: 2025-12-04T09:14:57.8081654Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8082100Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8082424Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8082993Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8084521Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8085720Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8086887Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8088045Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8089271Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8090457Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8091611Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8092857Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8094084Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8095241Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8095489Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.8095617Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8095693Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8095756Z unimplemented [] 2025-12-04T09:14:57.8095862Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8096047Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8097260Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8097333Z graph_break [] 2025-12-04T09:14:57.8097459Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8097531Z Autotune Choices Stats: 2025-12-04T09:14:57.8098958Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.8099212Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8099426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8099740Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8100867Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8102053Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8103165Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8104292Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8105400Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8106550Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8106791Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.8106881Z Autotune Choices Stats: 2025-12-04T09:14:57.8108355Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8108797Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8109125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8109722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8110911Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8112072Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8113228Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8114410Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8115568Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8116746Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8117905Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8119084Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8120292Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8121444Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8121688Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.8121852Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8121929Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8121991Z unimplemented [] 2025-12-04T09:14:57.8122096Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8122287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8123460Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8123531Z graph_break [] 2025-12-04T09:14:57.8123659Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8123733Z Autotune Choices Stats: 2025-12-04T09:14:57.8125163Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.8125414Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8125635Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8125951Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8127122Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8128274Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8129393Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8130519Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8131665Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8132874Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8133119Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.8133197Z Autotune Choices Stats: 2025-12-04T09:14:57.8134622Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8135070Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8135458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8136022Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8137184Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8138356Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8139558Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8140723Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8141909Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8143053Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8144240Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8145414Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8146573Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8147738Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8148022Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.8148155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8148229Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8148291Z unimplemented [] 2025-12-04T09:14:57.8148391Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8148576Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8149793Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8149864Z graph_break [] 2025-12-04T09:14:57.8149991Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8150070Z Autotune Choices Stats: 2025-12-04T09:14:57.8151458Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8151705Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8151921Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8152317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8153461Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8154573Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8155690Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8156846Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8157959Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8159121Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8159365Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.8159435Z Autotune Choices Stats: 2025-12-04T09:14:57.8160889Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.8161362Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8161683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8162397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8163558Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8164720Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8165936Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8167137Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8168293Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8169491Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8170686Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8171839Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8173059Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8174246Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8174510Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.8174638Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8174711Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8174774Z unimplemented [] 2025-12-04T09:14:57.8174877Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8175066Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8176275Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8176346Z graph_break [] 2025-12-04T09:14:57.8176471Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8176542Z Autotune Choices Stats: 2025-12-04T09:14:57.8177962Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8178244Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8178463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8178783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8179913Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8181026Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8182178Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8183289Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8184452Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8185566Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8185808Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.8185881Z Autotune Choices Stats: 2025-12-04T09:14:57.8187338Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8187803Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8188130Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8188702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8189859Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8191058Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8192244Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8193412Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8194567Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8195753Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8196936Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8198086Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8199285Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8200433Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8200683Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.8200844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8200920Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8200980Z unimplemented [] 2025-12-04T09:14:57.8201085Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8201277Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8202455Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8202526Z graph_break [] 2025-12-04T09:14:57.8202655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8202723Z Autotune Choices Stats: 2025-12-04T09:14:57.8204150Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8204423Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8204648Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8204964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8206104Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8207214Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8208365Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8209515Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8210629Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8211745Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8212057Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:57.8212130Z Autotune Choices Stats: 2025-12-04T09:14:57.8213614Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.8214053Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8214378Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8214941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8216157Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8217310Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8218537Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8219692Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8220873Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8222061Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8223215Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8224369Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8225549Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8226735Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8226988Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:57.8227115Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8227195Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8227262Z unimplemented [] 2025-12-04T09:14:57.8227365Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8227554Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8228739Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8228868Z graph_break [] 2025-12-04T09:14:57.8228994Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8229060Z Autotune Choices Stats: 2025-12-04T09:14:57.8230450Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8230689Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8230913Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8231226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8232350Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8233497Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8234650Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8235766Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8236883Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8238033Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8238318Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:57.8238391Z Autotune Choices Stats: 2025-12-04T09:14:57.8239820Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:57.8240258Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8240582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8241174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8242339Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8243535Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8244681Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8245872Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8247070Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8248223Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8249388Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8250581Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8251777Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8252974Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8253228Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:57.8253357Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8253434Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8253495Z unimplemented [] 2025-12-04T09:14:57.8253595Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8253786Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8254993Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8255089Z graph_break [] 2025-12-04T09:14:57.8255217Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8255285Z Autotune Choices Stats: 2025-12-04T09:14:57.8256682Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8256924Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8257148Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8257494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8258622Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8259741Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8260896Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8262159Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8263341Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8264516Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8264760Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:57.8264835Z Autotune Choices Stats: 2025-12-04T09:14:57.8266261Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8266750Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8267076Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8267640Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8268859Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8270023Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8271174Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8272398Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8273567Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8274720Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8275910Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8277062Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8278250Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8279405Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8279657Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:57.8279784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8279921Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8280016Z unimplemented [] 2025-12-04T09:14:57.8280130Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8280323Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8281500Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8281571Z graph_break [] 2025-12-04T09:14:57.8281697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8281763Z Autotune Choices Stats: 2025-12-04T09:14:57.8283152Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:57.8283426Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8283654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8283971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8285100Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8286241Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8287364Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8288515Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8289676Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8290796Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8291039Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:57.8291129Z Autotune Choices Stats: 2025-12-04T09:14:57.8292595Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8293076Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8293400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8293959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8295161Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8296326Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8297513Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8298702Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8299851Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8301005Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8302201Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8303392Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8304552Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8305737Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8306019Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:57.8306146Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8306223Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8306289Z unimplemented [] 2025-12-04T09:14:57.8306389Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8306578Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8307754Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8307821Z graph_break [] 2025-12-04T09:14:57.8307951Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8308019Z Autotune Choices Stats: 2025-12-04T09:14:57.8309414Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8309694Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8309917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8310233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8311417Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8312543Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8313673Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8314847Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8315969Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8317087Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8317358Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:57.8317426Z Autotune Choices Stats: 2025-12-04T09:14:57.8318866Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8319303Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8319658Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8320218Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8321391Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8322582Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8323759Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8324916Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8326069Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8327261Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8328450Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8329608Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8330766Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8331976Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8332228Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:57.8332396Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8332475Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8332539Z unimplemented [] 2025-12-04T09:14:57.8332654Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8332850Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8334024Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8334132Z graph_break [] 2025-12-04T09:14:57.8334259Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8334326Z Autotune Choices Stats: 2025-12-04T09:14:57.8335729Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8335971Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8336194Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8336543Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8337677Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8338794Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8339954Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8341100Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8342220Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8343345Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8343641Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:57.8343712Z Autotune Choices Stats: 2025-12-04T09:14:57.8345138Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8345601Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8345937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8346492Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8347694Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8348878Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8350033Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8351184Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8352406Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8353556Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8354737Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8355889Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8357074Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8358252Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8358500Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:57.8358683Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.8358767Z Traceback (most recent call last): 2025-12-04T09:14:57.8359080Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.8359147Z self.assertTrue( 2025-12-04T09:14:57.8359350Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.8359436Z raise self.failureException(msg) 2025-12-04T09:14:57.8359709Z AssertionError: False is not true : Log file /tmp/tmpa8611ro_/flex_attention_configs.json was not created 2025-12-04T09:14:57.8359713Z 2025-12-04T09:14:57.8359852Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.8360106Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.8360110Z 2025-12-04T09:14:57.8360281Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.8360409Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8360479Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8360548Z unimplemented [] 2025-12-04T09:14:57.8360651Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8362076Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.8362286Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8362351Z graph_break [] 2025-12-04T09:14:57.8362495Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8363506Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.8363598Z current_size = base.storage().size() 2025-12-04T09:14:57.8363669Z Autotune Choices Stats: 2025-12-04T09:14:57.8365124Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.8365415Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8365635Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8365963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8367094Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8368205Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8369369Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8370507Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8371623Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8372769Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8373037Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.8373145Z Autotune Choices Stats: 2025-12-04T09:14:57.8374645Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.8375087Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8375419Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8375982Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8377150Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8381873Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8383176Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8384376Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8385597Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8386812Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8387974Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8389144Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8390331Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8391488Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8391781Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.8391933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8392014Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8392089Z unimplemented [] 2025-12-04T09:14:57.8392200Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8392394Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8393594Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8393658Z graph_break [] 2025-12-04T09:14:57.8393799Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8393908Z Autotune Choices Stats: 2025-12-04T09:14:57.8395338Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8395597Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8395831Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8396162Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8397293Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8398442Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8399545Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8400692Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8401807Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8402952Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8403237Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.8403305Z Autotune Choices Stats: 2025-12-04T09:14:57.8404736Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8405174Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8405502Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8406099Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8407271Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8408460Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8409615Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8410762Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8411958Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8413218Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8414366Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8415548Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8416693Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8417871Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8418119Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.8418257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8418327Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8418397Z unimplemented [] 2025-12-04T09:14:57.8418500Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8418689Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8420058Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8420153Z graph_break [] 2025-12-04T09:14:57.8420293Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8420362Z Autotune Choices Stats: 2025-12-04T09:14:57.8421748Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8422005Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8422225Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8422551Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8423677Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8424827Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8425971Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8427089Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8428206Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8429378Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8429633Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.8429703Z Autotune Choices Stats: 2025-12-04T09:14:57.8431137Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8431571Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8431942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8432503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8433670Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8434860Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8436016Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8437190Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8438372Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8439532Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8440676Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8441861Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8443065Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8444231Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8444480Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.8444619Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8444689Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8444753Z unimplemented [] 2025-12-04T09:14:57.8444867Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8445058Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8446314Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8446379Z graph_break [] 2025-12-04T09:14:57.8446513Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8446581Z Autotune Choices Stats: 2025-12-04T09:14:57.8447966Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8448215Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8448434Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8448787Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8449917Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8451068Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8452182Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8453355Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8454504Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8455645Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8455899Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.8455969Z Autotune Choices Stats: 2025-12-04T09:14:57.8457395Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8457861Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8458197Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8458759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8459956Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8461113Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8462511Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8463730Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8464891Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8466036Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8467242Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8468400Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8469605Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8470755Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8471002Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.8471203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8471274Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8471336Z unimplemented [] 2025-12-04T09:14:57.8471456Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8471640Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8472823Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8472885Z graph_break [] 2025-12-04T09:14:57.8473012Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8473088Z Autotune Choices Stats: 2025-12-04T09:14:57.8474468Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8474774Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8474992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8475318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8476444Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8477607Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8478719Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8479874Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8481014Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8482129Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8482380Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.8482446Z Autotune Choices Stats: 2025-12-04T09:14:57.8483862Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8484337Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8484658Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8485257Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8486418Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8487576Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8488836Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8489979Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8491131Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8492392Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8493551Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8494741Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8495891Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8497080Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8497358Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.8497493Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8497561Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8497624Z unimplemented [] 2025-12-04T09:14:57.8497735Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8497917Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8499117Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8499179Z graph_break [] 2025-12-04T09:14:57.8499305Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8499379Z Autotune Choices Stats: 2025-12-04T09:14:57.8500769Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.8501064Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8501282Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8501603Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8502758Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8503880Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8505016Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8506178Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8507295Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8508411Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8508702Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.8508769Z Autotune Choices Stats: 2025-12-04T09:14:57.8510204Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8510640Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8510998Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8511566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8512730Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8513922Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8515113Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8516257Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8517407Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8518583Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8519771Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8520926Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8522071Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8523299Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8523547Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.8523690Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8523758Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8523820Z unimplemented [] 2025-12-04T09:14:57.8523926Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8524111Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8525287Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8525395Z graph_break [] 2025-12-04T09:14:57.8525524Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8525596Z Autotune Choices Stats: 2025-12-04T09:14:57.8526984Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8527232Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8527447Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8527811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8528945Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8530061Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8531405Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8532591Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8533712Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8534833Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8535119Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.8535186Z Autotune Choices Stats: 2025-12-04T09:14:57.8536647Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.8537091Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8537421Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8537987Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8539216Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8540407Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8541565Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8542712Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8543900Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8545077Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8546233Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8547391Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8548561Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8549767Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8550017Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.8550152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8550220Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8550282Z unimplemented [] 2025-12-04T09:14:57.8550389Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8550570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8551750Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8551852Z graph_break [] 2025-12-04T09:14:57.8551981Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8552057Z Autotune Choices Stats: 2025-12-04T09:14:57.8553483Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.8553736Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8553957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8554281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8555402Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8556549Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8557687Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8558817Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8559928Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8561081Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8561334Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.8561402Z Autotune Choices Stats: 2025-12-04T09:14:57.8563053Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8563500Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8563826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8564393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8565644Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8566820Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8567978Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8569171Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8570324Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8572905Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8575291Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8577703Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8580117Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8582497Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8583967Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.8584429Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8584746Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8584958Z unimplemented [] 2025-12-04T09:14:57.8585168Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8585533Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8586985Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8588295Z graph_break [] 2025-12-04T09:14:57.8588529Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8588824Z Autotune Choices Stats: 2025-12-04T09:14:57.8590373Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8592088Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8592636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8593250Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8594825Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8597165Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8599496Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8601810Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8604165Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8606467Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8607933Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.8608324Z Autotune Choices Stats: 2025-12-04T09:14:57.8609883Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8611805Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8612729Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8613728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8615538Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8617929Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8620308Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8622726Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8625149Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8627529Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8629942Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8632367Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8634739Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8637114Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8638642Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.8639093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8639386Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8639579Z unimplemented [] 2025-12-04T09:14:57.8639787Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8640155Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8641606Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8642943Z graph_break [] 2025-12-04T09:14:57.8643168Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8643443Z Autotune Choices Stats: 2025-12-04T09:14:57.8644950Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8646642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8647188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8647840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8649404Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8651728Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8654063Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8656416Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8658736Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8661078Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8662695Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.8663086Z Autotune Choices Stats: 2025-12-04T09:14:57.8664632Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8666675Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8667510Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8668463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8670271Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8672656Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8675087Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8677496Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8679892Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8682260Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8684738Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8687118Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8689489Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8691894Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8693400Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.8693849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8694131Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8694337Z unimplemented [] 2025-12-04T09:14:57.8694542Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8694905Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8696385Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8697693Z graph_break [] 2025-12-04T09:14:57.8697914Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8698186Z Autotune Choices Stats: 2025-12-04T09:14:57.8699692Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.8701482Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8702023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8702634Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8704156Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8706471Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8708799Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8711139Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8713471Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8715774Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8717207Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.8717586Z Autotune Choices Stats: 2025-12-04T09:14:57.8719175Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8721139Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8721974Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8722938Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8724741Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8727157Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8729538Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8731948Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8734386Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8736796Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8739210Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8741578Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8743947Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8746350Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8747819Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.8748272Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8748545Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8748794Z unimplemented [] 2025-12-04T09:14:57.8749004Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8749367Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8750801Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8752105Z graph_break [] 2025-12-04T09:14:57.8752328Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8752603Z Autotune Choices Stats: 2025-12-04T09:14:57.8754139Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.8755864Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8756400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8757011Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8758528Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8760827Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8763327Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8765711Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8768033Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8770346Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8771781Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.8772332Z Autotune Choices Stats: 2025-12-04T09:14:57.8773880Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8775808Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8776649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8777615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8779482Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8781891Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8784324Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8786711Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8789135Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8791540Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8793926Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8796315Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8798738Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8801154Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8802635Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.8803104Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8803393Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8803586Z unimplemented [] 2025-12-04T09:14:57.8803797Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8804165Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8805611Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8806914Z graph_break [] 2025-12-04T09:14:57.8807211Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8807504Z Autotune Choices Stats: 2025-12-04T09:14:57.8809019Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8810722Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8811265Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8811874Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8813462Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8815829Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8818128Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8820454Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8822755Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8825083Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8826549Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.8826936Z Autotune Choices Stats: 2025-12-04T09:14:57.8828472Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.8830389Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8831220Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8832217Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8834037Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8836488Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8838892Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8841283Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8843735Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8846128Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8848514Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8850935Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8853376Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8855791Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8857282Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.8857757Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8858041Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8858234Z unimplemented [] 2025-12-04T09:14:57.8858441Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8858801Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8860284Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8861613Z graph_break [] 2025-12-04T09:14:57.8861970Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8862263Z Autotune Choices Stats: 2025-12-04T09:14:57.8863762Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8865460Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8865992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8866605Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8868202Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8870515Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8872871Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8875176Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8877539Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8879893Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8881331Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.8881721Z Autotune Choices Stats: 2025-12-04T09:14:57.8883256Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.8885218Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8886049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8887014Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8888853Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8891247Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8893688Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8896097Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8898508Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8900879Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8903257Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8905684Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8908107Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8910489Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8911957Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.8912414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8912699Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8912901Z unimplemented [] 2025-12-04T09:14:57.8913137Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8913531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8914974Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8916272Z graph_break [] 2025-12-04T09:14:57.8916497Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8916777Z Autotune Choices Stats: 2025-12-04T09:14:57.8918277Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8919968Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8920547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8921152Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8922679Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8925033Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8927341Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8929649Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8932031Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8934382Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8935815Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:57.8936199Z Autotune Choices Stats: 2025-12-04T09:14:57.8937745Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.8939715Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8940546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8941502Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8943347Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8945740Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8948158Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8950582Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8952965Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8955343Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.8957754Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.8960205Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.8962732Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8965120Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.8966726Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:57.8967194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.8967471Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.8967666Z unimplemented [] 2025-12-04T09:14:57.8967870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.8968230Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.8969686Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.8970991Z graph_break [] 2025-12-04T09:14:57.8971220Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.8971493Z Autotune Choices Stats: 2025-12-04T09:14:57.8973033Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.8974797Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8975341Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8975953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8977526Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8979841Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8982149Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.8984492Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8986832Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.8989143Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.8990574Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:57.8990991Z Autotune Choices Stats: 2025-12-04T09:14:57.8992525Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:57.8994446Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.8995291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.8996287Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.8998094Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9000518Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9002925Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9005305Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9007683Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9010092Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9012504Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9014920Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9017298Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9019719Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9021217Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:57.9021668Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9021964Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9022150Z unimplemented [] 2025-12-04T09:14:57.9022358Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9022721Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9024171Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9025463Z graph_break [] 2025-12-04T09:14:57.9025690Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9026010Z Autotune Choices Stats: 2025-12-04T09:14:57.9027515Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9029214Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9029745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9030357Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9031923Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9034237Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9036586Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9038920Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9041242Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9043558Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9045034Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:57.9045428Z Autotune Choices Stats: 2025-12-04T09:14:57.9046966Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9048925Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9049754Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9050726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9052581Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9055048Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9057437Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9059817Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9062356Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9064742Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9067172Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9069591Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9072024Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9074452Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9075929Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:57.9076376Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9076657Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9076851Z unimplemented [] 2025-12-04T09:14:57.9077070Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9077434Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9078884Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9080245Z graph_break [] 2025-12-04T09:14:57.9080465Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9080749Z Autotune Choices Stats: 2025-12-04T09:14:57.9082254Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:57.9083940Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9084513Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9085126Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9086646Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9088993Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9091321Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9093670Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9095966Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9098305Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9099743Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:57.9100131Z Autotune Choices Stats: 2025-12-04T09:14:57.9101722Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9103654Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9104487Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9105473Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9107315Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9109751Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9112146Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9114539Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9116958Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9119384Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9121764Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9124168Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9126584Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9128981Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9130449Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:57.9130901Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9131184Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9131379Z unimplemented [] 2025-12-04T09:14:57.9131581Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9131971Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9133453Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9134771Z graph_break [] 2025-12-04T09:14:57.9135000Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9135275Z Autotune Choices Stats: 2025-12-04T09:14:57.9136798Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9138498Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9139047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9139655Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9141193Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9143567Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9145882Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9148191Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9150545Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9152852Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9154296Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:57.9154685Z Autotune Choices Stats: 2025-12-04T09:14:57.9156253Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9158300Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9159137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9160133Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9162092Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9164488Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9166881Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9169361Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9171795Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9174229Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9176621Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9179065Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9181510Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9183896Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9185372Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:57.9185878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9186162Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9186352Z unimplemented [] 2025-12-04T09:14:57.9186558Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9186924Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9188369Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9189666Z graph_break [] 2025-12-04T09:14:57.9189905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9190186Z Autotune Choices Stats: 2025-12-04T09:14:57.9191726Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9193428Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9193962Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9194577Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9196144Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9198486Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9200796Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9203096Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9205432Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9207771Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9209216Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:57.9209598Z Autotune Choices Stats: 2025-12-04T09:14:57.9211133Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9213130Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9214036Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9214989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9216808Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9219191Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9221605Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9223980Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9226394Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9228772Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9231187Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9233614Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9236004Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9238388Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9239945Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:57.9240395Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9240681Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9240880Z unimplemented [] 2025-12-04T09:14:57.9241080Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9241443Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9242921Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9244235Z graph_break [] 2025-12-04T09:14:57.9244463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9244748Z Autotune Choices Stats: 2025-12-04T09:14:57.9246256Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9247952Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9248530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9249174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9250701Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9253067Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9255377Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9257721Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9260055Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9262465Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9263893Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:57.9264281Z Autotune Choices Stats: 2025-12-04T09:14:57.9265879Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.9267860Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9268692Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9269663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9271482Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9273873Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9276312Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9278745Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9281129Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9283547Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9285973Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9288358Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9290738Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9293209Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9294691Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:57.9295190Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.9295516Z Traceback (most recent call last): 2025-12-04T09:14:57.9295963Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.9296404Z self.assertTrue( 2025-12-04T09:14:57.9296748Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.9297103Z raise self.failureException(msg) 2025-12-04T09:14:57.9297518Z AssertionError: False is not true : Log file /tmp/tmpw59c6t4n/flex_attention_configs.json was not created 2025-12-04T09:14:57.9297835Z 2025-12-04T09:14:57.9297977Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.9298435Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.9298769Z 2025-12-04T09:14:57.9298931Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.9299308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9299587Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9299776Z unimplemented [] 2025-12-04T09:14:57.9299981Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9301400Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.9302869Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9303188Z graph_break [] 2025-12-04T09:14:57.9303413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9304617Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.9305745Z current_size = base.storage().size() 2025-12-04T09:14:57.9305959Z Autotune Choices Stats: 2025-12-04T09:14:57.9307474Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.9309206Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9309763Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9310386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9311907Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9314241Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9315354Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9316502Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9317642Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9318756Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9319013Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.9319080Z Autotune Choices Stats: 2025-12-04T09:14:57.9320501Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.9320996Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9321323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9321924Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9323096Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9324262Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9325445Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9326616Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9327775Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9328932Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9330111Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9331290Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9332497Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9333701Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9333983Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.9334120Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9334192Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9334254Z unimplemented [] 2025-12-04T09:14:57.9334362Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9334547Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9335732Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9335793Z graph_break [] 2025-12-04T09:14:57.9335920Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9335992Z Autotune Choices Stats: 2025-12-04T09:14:57.9337381Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9337669Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9337886Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9338214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9339411Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9340527Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9341637Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9342814Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9343925Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9345040Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9345314Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.9345388Z Autotune Choices Stats: 2025-12-04T09:14:57.9346811Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9347249Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9347605Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9348171Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9349344Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9350536Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9351729Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9352881Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9354035Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9355217Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9356396Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9357544Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9358704Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9360203Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9360449Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.9365472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9365601Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9365687Z unimplemented [] 2025-12-04T09:14:57.9365819Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9366020Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9367212Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9367398Z graph_break [] 2025-12-04T09:14:57.9367542Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9367622Z Autotune Choices Stats: 2025-12-04T09:14:57.9369068Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9369333Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9369563Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9369943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9371103Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9372231Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9373472Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9374648Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9375774Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9376896Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9377208Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.9377290Z Autotune Choices Stats: 2025-12-04T09:14:57.9378736Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9379212Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9379546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9380123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9381321Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9382507Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9383660Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9384808Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9385995Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9387136Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9388310Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9389464Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9390643Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9391830Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9392083Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.9392224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9392303Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9392368Z unimplemented [] 2025-12-04T09:14:57.9392485Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9392675Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9393849Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9393952Z graph_break [] 2025-12-04T09:14:57.9394093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9394169Z Autotune Choices Stats: 2025-12-04T09:14:57.9395562Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9395843Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9396064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9396383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9397513Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9398663Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9399812Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9400927Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9402030Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9403175Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9403422Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.9403498Z Autotune Choices Stats: 2025-12-04T09:14:57.9404948Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9405391Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9405717Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9406286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9407482Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9408712Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9409864Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9411016Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9412200Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9413425Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9414576Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9415751Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9416983Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9418132Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9418377Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:57.9418509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9418600Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9418699Z unimplemented [] 2025-12-04T09:14:57.9418806Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9418998Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9420173Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9420243Z graph_break [] 2025-12-04T09:14:57.9420370Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9420448Z Autotune Choices Stats: 2025-12-04T09:14:57.9421862Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9422117Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9422335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9422651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9423815Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9424957Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9426107Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9427227Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9428369Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9429485Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9429735Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:57.9429842Z Autotune Choices Stats: 2025-12-04T09:14:57.9431268Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9431710Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9432036Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9432668Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9433842Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9435006Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9436147Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9437328Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9438510Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9439665Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9440812Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9442038Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9443195Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9444339Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9444620Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:57.9444751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9444828Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9444891Z unimplemented [] 2025-12-04T09:14:57.9444992Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9445184Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9446353Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9446422Z graph_break [] 2025-12-04T09:14:57.9446547Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9446666Z Autotune Choices Stats: 2025-12-04T09:14:57.9448054Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.9448307Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9448528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9448851Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9450038Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9451148Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9452262Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9453433Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9454581Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9455724Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9455968Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:57.9456039Z Autotune Choices Stats: 2025-12-04T09:14:57.9457460Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9457929Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9458296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9458869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9460048Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9461212Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9462548Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9463717Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9464928Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9466087Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9467289Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9468486Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9469641Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9470789Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9471093Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:57.9471225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9471306Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9471382Z unimplemented [] 2025-12-04T09:14:57.9471489Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9471682Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9472912Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9472986Z graph_break [] 2025-12-04T09:14:57.9473116Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9473184Z Autotune Choices Stats: 2025-12-04T09:14:57.9474589Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9474841Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9475132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9475447Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9476585Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9477703Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9478821Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9479971Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9481107Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9482228Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9482473Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:57.9482548Z Autotune Choices Stats: 2025-12-04T09:14:57.9484012Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.9484481Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9484805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9485373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9486544Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9487754Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9488912Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9490101Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9491245Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9494529Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9495757Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9496952Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9498123Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9499270Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9499530Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:57.9499669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9499744Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9499818Z unimplemented [] 2025-12-04T09:14:57.9499928Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9500157Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9501340Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9501406Z graph_break [] 2025-12-04T09:14:57.9501547Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9501615Z Autotune Choices Stats: 2025-12-04T09:14:57.9503066Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.9503441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9503669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9503989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9505126Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9506249Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9507369Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9508480Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9509645Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9510759Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9511062Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:57.9511128Z Autotune Choices Stats: 2025-12-04T09:14:57.9512628Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9513065Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9513401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9513962Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9515132Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9516289Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9517475Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9518631Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9519815Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9521041Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9522188Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9523349Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9524496Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9525644Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9525931Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:57.9526064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9526137Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9526221Z unimplemented [] 2025-12-04T09:14:57.9526328Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9526517Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9527693Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9527796Z graph_break [] 2025-12-04T09:14:57.9527930Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9528000Z Autotune Choices Stats: 2025-12-04T09:14:57.9529459Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9529703Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9529932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9530248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9531378Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9532554Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9533700Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9534851Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9535970Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9537111Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9537450Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:57.9537515Z Autotune Choices Stats: 2025-12-04T09:14:57.9538958Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9539396Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9539728Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9540287Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9541459Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9542644Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9543802Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9544948Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9546177Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9547377Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9548526Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9549677Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9550826Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9552014Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9552270Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:57.9552398Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9552468Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9552539Z unimplemented [] 2025-12-04T09:14:57.9552641Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9552827Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9554089Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9554220Z graph_break [] 2025-12-04T09:14:57.9554355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9554421Z Autotune Choices Stats: 2025-12-04T09:14:57.9555820Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9556064Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9556286Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9556607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9557740Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9558862Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9560015Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9561137Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9562418Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9563727Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9563985Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:57.9564056Z Autotune Choices Stats: 2025-12-04T09:14:57.9565502Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9565938Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9566273Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9566838Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9568038Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9569261Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9570415Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9571646Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9572874Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9574037Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9575189Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9576343Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9577523Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9578683Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9578935Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:57.9579066Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9579172Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9579242Z unimplemented [] 2025-12-04T09:14:57.9579345Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9579529Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9580785Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9580845Z graph_break [] 2025-12-04T09:14:57.9580984Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9581060Z Autotune Choices Stats: 2025-12-04T09:14:57.9582462Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:57.9582705Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9582922Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9583246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9584375Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9585533Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9586662Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9587775Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9588967Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9590107Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9590359Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:57.9590426Z Autotune Choices Stats: 2025-12-04T09:14:57.9591859Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9592290Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9592622Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9593182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9594405Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9595567Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9596755Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9597990Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9599160Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9600327Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9601484Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9602644Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9603842Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9605013Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9605295Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:57.9605500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9605570Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9605641Z unimplemented [] 2025-12-04T09:14:57.9605743Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9605926Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9607103Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9607165Z graph_break [] 2025-12-04T09:14:57.9607301Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9607366Z Autotune Choices Stats: 2025-12-04T09:14:57.9608769Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:57.9609023Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9609246Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9609575Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9610709Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9611881Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9613050Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9614258Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9615404Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9616519Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9616772Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:57.9616839Z Autotune Choices Stats: 2025-12-04T09:14:57.9618268Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9618704Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9619038Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9619628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9620805Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9621958Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9623203Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9624363Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9625521Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9626673Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9627818Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9629009Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9630154Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9631385Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9631677Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:57.9631815Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9631882Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9631950Z unimplemented [] 2025-12-04T09:14:57.9632049Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9632230Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9633411Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9633472Z graph_break [] 2025-12-04T09:14:57.9633605Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9633670Z Autotune Choices Stats: 2025-12-04T09:14:57.9635066Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9635318Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9635533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9635849Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9637017Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9638154Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9639296Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9640489Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9641614Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9642722Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9642971Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:57.9643035Z Autotune Choices Stats: 2025-12-04T09:14:57.9644463Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.9644891Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9645257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9645816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9646987Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9648213Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9649402Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9650555Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9651709Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9652915Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9654101Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9655256Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9656431Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9657657Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9657902Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:57.9658039Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9658108Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9658173Z unimplemented [] 2025-12-04T09:14:57.9658275Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9658456Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9659644Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9659706Z graph_break [] 2025-12-04T09:14:57.9659850Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9659921Z Autotune Choices Stats: 2025-12-04T09:14:57.9661317Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9661565Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9661967Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9662299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9663439Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9664572Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9665843Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9666954Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9668077Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9669203Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9669455Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:57.9669521Z Autotune Choices Stats: 2025-12-04T09:14:57.9670996Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9671430Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9671766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9672326Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9673551Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9674765Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9675923Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9677080Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9678239Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9679431Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9680581Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9681731Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9682938Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9684132Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9684383Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:57.9684519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9684588Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9684650Z unimplemented [] 2025-12-04T09:14:57.9684756Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9684936Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9686118Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9686180Z graph_break [] 2025-12-04T09:14:57.9686311Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9686375Z Autotune Choices Stats: 2025-12-04T09:14:57.9687853Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9688119Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9688343Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9688664Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9689793Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9690976Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9692139Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9693322Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9694441Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9695552Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9695800Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:57.9695864Z Autotune Choices Stats: 2025-12-04T09:14:57.9697326Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.9697761Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9698089Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9698705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9699931Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9701083Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9702254Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9703405Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9704559Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9705736Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9706893Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9708106Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9709291Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9710447Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9710692Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:57.9710824Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9710894Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9710957Z unimplemented [] 2025-12-04T09:14:57.9711067Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9711249Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9712437Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9712496Z graph_break [] 2025-12-04T09:14:57.9712629Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9712695Z Autotune Choices Stats: 2025-12-04T09:14:57.9714118Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9714369Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9714589Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9714907Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9716099Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9717243Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9718359Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9719492Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9720615Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9721763Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9722012Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:57.9722077Z Autotune Choices Stats: 2025-12-04T09:14:57.9723497Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:57.9723927Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9724320Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9724910Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9726074Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9727230Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9728383Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9729541Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9730741Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9731892Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9733115Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9734348Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9735502Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9736654Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9736903Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:57.9737036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9737109Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9737170Z unimplemented [] 2025-12-04T09:14:57.9737277Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9737462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9738695Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9738758Z graph_break [] 2025-12-04T09:14:57.9738885Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9738960Z Autotune Choices Stats: 2025-12-04T09:14:57.9740347Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9740601Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9740853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9741241Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9742375Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9743488Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9744605Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9745727Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9746848Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9748005Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9748256Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:57.9748322Z Autotune Choices Stats: 2025-12-04T09:14:57.9749778Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9750276Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9750599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9751162Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9752334Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9753496Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9754652Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9755837Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9756998Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9758153Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9759429Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9760584Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9761741Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9763248Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9763497Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:57.9763632Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9763701Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9763761Z unimplemented [] 2025-12-04T09:14:57.9763867Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9764048Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9765307Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9765373Z graph_break [] 2025-12-04T09:14:57.9765501Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9765571Z Autotune Choices Stats: 2025-12-04T09:14:57.9766970Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:57.9767361Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9767581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9767902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9769031Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9770155Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9771268Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9772427Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9773592Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9774715Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9774960Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:57.9775067Z Autotune Choices Stats: 2025-12-04T09:14:57.9776520Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9776987Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9777312Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9777877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9779055Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9780217Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9781371Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9782554Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9783710Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9784919Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9786106Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9787256Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9788406Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9789558Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9789801Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:57.9789938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9790042Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9790104Z unimplemented [] 2025-12-04T09:14:57.9790219Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9790405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9791589Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9791647Z graph_break [] 2025-12-04T09:14:57.9791775Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9791878Z Autotune Choices Stats: 2025-12-04T09:14:57.9793315Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9793595Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9793811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9794131Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9795261Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9796377Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9797504Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9798658Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9799775Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9800896Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9801237Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:57.9801303Z Autotune Choices Stats: 2025-12-04T09:14:57.9802720Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9803157Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9803483Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9804044Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9805207Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9806366Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9807550Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9808699Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9810064Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9811243Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9812433Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9813595Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9814759Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9815948Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9816196Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:57.9816330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9816397Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9816461Z unimplemented [] 2025-12-04T09:14:57.9816568Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9816750Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9817929Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9818070Z graph_break [] 2025-12-04T09:14:57.9818278Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9818350Z Autotune Choices Stats: 2025-12-04T09:14:57.9819740Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9819991Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9820208Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9820530Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9821660Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9822780Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9823929Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9825044Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9826157Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9827359Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9827635Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:57.9827702Z Autotune Choices Stats: 2025-12-04T09:14:57.9829129Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9829567Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9829890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9830457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9831624Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9832809Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9833970Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9835162Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9836377Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9837521Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9838675Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9839827Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9840972Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9842153Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9842395Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:57.9842527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9842595Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9842655Z unimplemented [] 2025-12-04T09:14:57.9842760Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9842979Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9844192Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9844281Z graph_break [] 2025-12-04T09:14:57.9844407Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9844477Z Autotune Choices Stats: 2025-12-04T09:14:57.9845868Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9846118Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9846336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9846655Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9847787Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9848915Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9850059Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9851181Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9852407Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9853588Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9853835Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:57.9853905Z Autotune Choices Stats: 2025-12-04T09:14:57.9855337Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:57.9855778Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9856107Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9856673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9857882Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9859048Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9860200Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9861403Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9862698Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9863853Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9865011Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9866162Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9867369Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9868531Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9868775Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:57.9868959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9869026Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9869133Z unimplemented [] 2025-12-04T09:14:57.9869285Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9869469Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9870648Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9870710Z graph_break [] 2025-12-04T09:14:57.9870837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9870906Z Autotune Choices Stats: 2025-12-04T09:14:57.9872297Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9872548Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9872767Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9873090Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9874225Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9875378Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9876490Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9877657Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9878835Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9879960Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9880217Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:57.9880290Z Autotune Choices Stats: 2025-12-04T09:14:57.9881717Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9882157Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9882478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9883044Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9884254Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9885411Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9886632Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9887824Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9888976Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9890121Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9891288Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9892510Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9893667Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9894815Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9895156Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:57.9895334Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:57.9895411Z Traceback (most recent call last): 2025-12-04T09:14:57.9895707Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:57.9895775Z self.assertTrue( 2025-12-04T09:14:57.9895975Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:57.9896056Z raise self.failureException(msg) 2025-12-04T09:14:57.9896301Z AssertionError: False is not true : Log file /tmp/tmp92s2b8yb/flex_attention_configs.json was not created 2025-12-04T09:14:57.9896307Z 2025-12-04T09:14:57.9896450Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:57.9896711Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:57.9896716Z 2025-12-04T09:14:57.9896876Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:57.9897005Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9897078Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9897141Z unimplemented [] 2025-12-04T09:14:57.9897244Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9898442Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:57.9898629Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9898701Z graph_break [] 2025-12-04T09:14:57.9898827Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9899817Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:57.9899901Z current_size = base.storage().size() 2025-12-04T09:14:57.9900009Z Autotune Choices Stats: 2025-12-04T09:14:57.9901407Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:57.9901653Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9901876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9902228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9903393Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9904532Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9905645Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9906759Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9907868Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9909015Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9909260Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:57.9909331Z Autotune Choices Stats: 2025-12-04T09:14:57.9910753Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:57.9911229Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9911618Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9912174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9913352Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9914509Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9915650Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9916800Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9917982Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9919129Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9920365Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9921540Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9922689Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9923830Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9924083Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:57.9924215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9924292Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9924354Z unimplemented [] 2025-12-04T09:14:57.9924457Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9924660Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9925868Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9925936Z graph_break [] 2025-12-04T09:14:57.9926064Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9926127Z Autotune Choices Stats: 2025-12-04T09:14:57.9927522Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9927794Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9928087Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9928401Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9929536Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9930644Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9931756Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9932959Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9934115Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9935231Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9935473Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:57.9935537Z Autotune Choices Stats: 2025-12-04T09:14:57.9936989Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9937490Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9937816Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9938374Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9939543Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9940693Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9941844Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9943022Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9944174Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9945343Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9946550Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9947691Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9948848Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9949997Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9950247Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:57.9950374Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9950449Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9950508Z unimplemented [] 2025-12-04T09:14:57.9950608Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9950831Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9952006Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9952069Z graph_break [] 2025-12-04T09:14:57.9952193Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9952258Z Autotune Choices Stats: 2025-12-04T09:14:57.9953688Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9954036Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9954259Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9954573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9955702Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9956815Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9957933Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9959041Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9960184Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9961301Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9961574Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:57.9961639Z Autotune Choices Stats: 2025-12-04T09:14:57.9963230Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9963726Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9964060Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9964620Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9965784Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9966943Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9968152Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9969301Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9970441Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9971716Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9973030Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9974182Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9975334Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9976477Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:57.9976726Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:57.9976892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:57.9976967Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:57.9977030Z unimplemented [] 2025-12-04T09:14:57.9977171Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:57.9977363Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:57.9978553Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:57.9978659Z graph_break [] 2025-12-04T09:14:57.9978800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:57.9978874Z Autotune Choices Stats: 2025-12-04T09:14:57.9980283Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:57.9980561Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9980779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9981133Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9982268Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9983374Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9984496Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:57.9985668Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:57.9986784Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9987935Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9988240Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:57.9988312Z Autotune Choices Stats: 2025-12-04T09:14:57.9989731Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:57.9990171Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:57.9990495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:57.9991069Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:57.9992236Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9993426Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9994647Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:57.9995808Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:57.9997027Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:57.9998232Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:57.9999424Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0000631Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0001783Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0002969Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0003220Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.0003349Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0003436Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0003501Z unimplemented [] 2025-12-04T09:14:58.0003607Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0003794Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0005033Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0005250Z graph_break [] 2025-12-04T09:14:58.0005376Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0005477Z Autotune Choices Stats: 2025-12-04T09:14:58.0006869Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0007116Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0007333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0007677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0008806Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0009921Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0011079Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0012197Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0013362Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0014543Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0014836Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.0014908Z Autotune Choices Stats: 2025-12-04T09:14:58.0016332Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0016771Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0017096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0017660Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0018828Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0020016Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0021202Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0022381Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0023598Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0024748Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0025897Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0027042Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0028219Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0029363Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0029613Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.0029741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0029817Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0029878Z unimplemented [] 2025-12-04T09:14:58.0030015Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0030204Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0031410Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0031509Z graph_break [] 2025-12-04T09:14:58.0031635Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0031705Z Autotune Choices Stats: 2025-12-04T09:14:58.0033086Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.0033334Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0033552Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0033869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0035017Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0036140Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0037297Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0038418Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0039601Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0040749Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0040992Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.0041064Z Autotune Choices Stats: 2025-12-04T09:14:58.0042495Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0042930Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0043253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0043820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0045014Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0046192Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0047345Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0048644Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0049796Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0050946Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0052098Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0053316Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0054506Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0055657Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0055940Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.0056071Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0056181Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0056272Z unimplemented [] 2025-12-04T09:14:58.0056376Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0056563Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0057733Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0057803Z graph_break [] 2025-12-04T09:14:58.0057928Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0057994Z Autotune Choices Stats: 2025-12-04T09:14:58.0059389Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0059627Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0059851Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0060169Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0061304Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0062610Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0063763Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0064944Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0066162Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0067283Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0067532Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.0067605Z Autotune Choices Stats: 2025-12-04T09:14:58.0069046Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.0069490Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0069815Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0070382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0071592Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0072756Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0073968Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0075154Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0076307Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0077462Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0078631Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0079820Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0080974Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0082176Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0082493Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.0082623Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0082697Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0082759Z unimplemented [] 2025-12-04T09:14:58.0082861Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0083048Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0084225Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0084292Z graph_break [] 2025-12-04T09:14:58.0084417Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0084483Z Autotune Choices Stats: 2025-12-04T09:14:58.0085872Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.0086117Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0086343Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0086661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0087829Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0088946Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0090060Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0091309Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0092456Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0093577Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0093818Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.0093891Z Autotune Choices Stats: 2025-12-04T09:14:58.0095314Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0095752Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0096113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0096683Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0097849Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0099078Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0100258Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0101414Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0102562Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0103713Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0104902Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0106046Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0107209Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0108450Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0108701Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.0108838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0108918Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0108982Z unimplemented [] 2025-12-04T09:14:58.0109083Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0109274Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0110450Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0110516Z graph_break [] 2025-12-04T09:14:58.0110643Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0110710Z Autotune Choices Stats: 2025-12-04T09:14:58.0112101Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0112344Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0112562Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0112930Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0114062Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0115172Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0116362Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0117513Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0118626Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0119752Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0119996Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.0120068Z Autotune Choices Stats: 2025-12-04T09:14:58.0121487Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0121958Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0122286Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0122845Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0124050Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0125266Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0126426Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0127586Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0128732Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0129892Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0131075Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0132221Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0133531Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0134708Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0134956Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.0135092Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0135165Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0135227Z unimplemented [] 2025-12-04T09:14:58.0135330Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0135520Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0136694Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0136766Z graph_break [] 2025-12-04T09:14:58.0136895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0136961Z Autotune Choices Stats: 2025-12-04T09:14:58.0138351Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0138631Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0138867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0139185Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0140322Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0141500Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0142662Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0143783Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0144896Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0146019Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0146263Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.0146333Z Autotune Choices Stats: 2025-12-04T09:14:58.0147789Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0148228Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0148551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0149123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0150351Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0151537Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0152686Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0153836Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0154980Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0156166Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0157325Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0158502Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0159720Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0160867Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0161125Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.0161254Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0161325Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0161385Z unimplemented [] 2025-12-04T09:14:58.0161483Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0161671Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0163039Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0163108Z graph_break [] 2025-12-04T09:14:58.0163233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0163298Z Autotune Choices Stats: 2025-12-04T09:14:58.0164754Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.0165001Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0165221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0165532Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0166707Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0167916Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0169060Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0170183Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0171294Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0172465Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0172748Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.0172823Z Autotune Choices Stats: 2025-12-04T09:14:58.0174247Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0174686Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0175045Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0175689Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0176856Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0178012Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0179165Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0180319Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0181495Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0182651Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0183802Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0185063Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0186217Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0187368Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0187623Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.0187755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0187829Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0187891Z unimplemented [] 2025-12-04T09:14:58.0187997Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0188185Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0189366Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0189434Z graph_break [] 2025-12-04T09:14:58.0189595Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0189663Z Autotune Choices Stats: 2025-12-04T09:14:58.0191096Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.0191341Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0191568Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0191919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0193113Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0194221Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0195338Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0196453Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0197561Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0198706Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0198960Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.0199030Z Autotune Choices Stats: 2025-12-04T09:14:58.0200451Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0200950Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0201302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0201861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0203020Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0204174Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0205328Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0206506Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0207660Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0208812Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0210044Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0211218Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0212428Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0213592Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0213845Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.0213973Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0214047Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0214107Z unimplemented [] 2025-12-04T09:14:58.0214207Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0214400Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0215612Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0215681Z graph_break [] 2025-12-04T09:14:58.0215809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0215874Z Autotune Choices Stats: 2025-12-04T09:14:58.0217259Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0217540Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0217820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0218139Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0219267Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0220376Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0221504Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0222635Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0223824Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0224949Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0225189Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.0225263Z Autotune Choices Stats: 2025-12-04T09:14:58.0226726Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.0227222Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0227548Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0228112Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0229272Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0230437Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0231587Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0232771Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0233926Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0235132Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0236312Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0237461Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0238616Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0239775Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0240019Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.0240147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0240220Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0240282Z unimplemented [] 2025-12-04T09:14:58.0240445Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0240639Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0241818Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0241882Z graph_break [] 2025-12-04T09:14:58.0242007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0242079Z Autotune Choices Stats: 2025-12-04T09:14:58.0243496Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0243804Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0244019Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0244334Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0245466Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0246575Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0247697Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0248855Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0249967Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0251082Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0251682Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.0251789Z Autotune Choices Stats: 2025-12-04T09:14:58.0253311Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0253754Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0254081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0254648Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0255812Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0256971Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0258155Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0259314Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0260504Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0261710Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0263031Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0264190Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0265346Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0266496Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0266801Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.0266935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0267028Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0267091Z unimplemented [] 2025-12-04T09:14:58.0267193Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0267387Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0268563Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0268678Z graph_break [] 2025-12-04T09:14:58.0268807Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0268934Z Autotune Choices Stats: 2025-12-04T09:14:58.0270391Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0270642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0270861Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0271179Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0272307Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0273418Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0274532Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0275697Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0276811Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0277986Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0278257Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.0278325Z Autotune Choices Stats: 2025-12-04T09:14:58.0279753Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.0280189Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0280509Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0281071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0282231Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0283418Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0284570Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0285721Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0286967Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0288126Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0289297Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0290438Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0291583Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0292838Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0293086Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.0302836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0302956Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0303028Z unimplemented [] 2025-12-04T09:14:58.0303158Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0303363Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0304698Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0304817Z graph_break [] 2025-12-04T09:14:58.0304963Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0305044Z Autotune Choices Stats: 2025-12-04T09:14:58.0306458Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0306720Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0306949Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0307284Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0308436Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0309582Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0310738Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0311877Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0313036Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0314253Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0314515Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.0314585Z Autotune Choices Stats: 2025-12-04T09:14:58.0316024Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.0316470Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0316800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0317373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0318538Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0319731Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0320886Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0322090Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0323271Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0324417Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0325565Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0326715Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0327889Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0329053Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0329306Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.0329450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0329566Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0329630Z unimplemented [] 2025-12-04T09:14:58.0329743Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0330000Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0331179Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0331241Z graph_break [] 2025-12-04T09:14:58.0331373Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0331452Z Autotune Choices Stats: 2025-12-04T09:14:58.0332923Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0333181Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0333403Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0333732Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0334859Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0336008Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0337120Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0338233Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0339442Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0340560Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0340814Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.0340886Z Autotune Choices Stats: 2025-12-04T09:14:58.0342304Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0342749Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0343080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0343652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0344847Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0346016Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0347234Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0348446Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0349612Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0350754Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0351911Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0353087Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0354231Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0355379Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0355656Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.0355857Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0355931Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0356008Z unimplemented [] 2025-12-04T09:14:58.0356120Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0356310Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0357489Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0357559Z graph_break [] 2025-12-04T09:14:58.0357690Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0357765Z Autotune Choices Stats: 2025-12-04T09:14:58.0359149Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.0359403Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0359629Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0359956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0361080Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0362478Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0363648Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0364878Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0366091Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0367213Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0367467Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.0367536Z Autotune Choices Stats: 2025-12-04T09:14:58.0368975Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0369423Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0369752Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0370354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0371523Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0372743Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0373997Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0375146Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0376305Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0377454Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0378618Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0379826Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0380981Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0382199Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0382479Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.0382625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0382704Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0382768Z unimplemented [] 2025-12-04T09:14:58.0382884Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0383077Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0384256Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0384319Z graph_break [] 2025-12-04T09:14:58.0384452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0384526Z Autotune Choices Stats: 2025-12-04T09:14:58.0385916Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0386169Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0386388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0386710Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0387876Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0389006Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0390176Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0391326Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0392439Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0393560Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0393806Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.0393880Z Autotune Choices Stats: 2025-12-04T09:14:58.0395315Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0395757Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0396119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0396684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0397849Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0399067Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0400253Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0401399Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0402550Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0403699Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0404901Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0406050Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0407234Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0408464Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0408717Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.0408855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0408927Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0408994Z unimplemented [] 2025-12-04T09:14:58.0409114Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0409306Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0410486Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0410550Z graph_break [] 2025-12-04T09:14:58.0410680Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0410757Z Autotune Choices Stats: 2025-12-04T09:14:58.0412142Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0412458Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0412732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0413062Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0414190Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0415343Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0416529Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0417652Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0418765Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0419885Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0420129Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.0420202Z Autotune Choices Stats: 2025-12-04T09:14:58.0421650Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0422090Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0422416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0422982Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0424205Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0425401Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0426553Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0427708Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0428858Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0430033Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0431191Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0432353Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0433607Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0434761Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0435009Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.0435143Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0435211Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0435273Z unimplemented [] 2025-12-04T09:14:58.0435381Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0435568Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0436738Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0436807Z graph_break [] 2025-12-04T09:14:58.0436936Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0437009Z Autotune Choices Stats: 2025-12-04T09:14:58.0438426Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0438679Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0438893Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0439221Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0440354Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0441596Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0442753Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0443889Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0445005Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0446126Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0446367Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.0446441Z Autotune Choices Stats: 2025-12-04T09:14:58.0447964Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.0448406Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0448734Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0449391Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0450622Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0451785Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0453017Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0454202Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0455359Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0456551Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0457724Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0458939Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0460120Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0461278Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0461524Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.0461662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0461732Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0461946Z unimplemented [] 2025-12-04T09:14:58.0462071Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0462260Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0463448Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0463511Z graph_break [] 2025-12-04T09:14:58.0463640Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0463715Z Autotune Choices Stats: 2025-12-04T09:14:58.0465176Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0465433Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0465655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0465982Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0467263Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0468437Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0469552Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0470677Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0471791Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0472944Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0473188Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.0473262Z Autotune Choices Stats: 2025-12-04T09:14:58.0474687Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0475173Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0475542Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0476136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0477295Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0478456Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0479660Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0480839Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0482028Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0483178Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0484406Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0485585Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0486742Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0487897Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0488140Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.0488277Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0488346Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0488409Z unimplemented [] 2025-12-04T09:14:58.0488515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0488700Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0489912Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0489982Z graph_break [] 2025-12-04T09:14:58.0490119Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0490196Z Autotune Choices Stats: 2025-12-04T09:14:58.0491580Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0491864Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0492081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0492605Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0493744Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0494898Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0496015Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0497139Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0498251Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0499408Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0499652Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.0499727Z Autotune Choices Stats: 2025-12-04T09:14:58.0501188Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.0501737Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0502064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0502632Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0503798Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0504960Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0506106Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0507347Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0508510Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0509700Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0510915Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0512062Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0513213Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0514368Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0514613Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.0514793Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.0514876Z Traceback (most recent call last): 2025-12-04T09:14:58.0515177Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.0515291Z self.assertTrue( 2025-12-04T09:14:58.0515491Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.0515613Z raise self.failureException(msg) 2025-12-04T09:14:58.0515865Z AssertionError: False is not true : Log file /tmp/tmp2f1po3gr/flex_attention_configs.json was not created 2025-12-04T09:14:58.0515872Z 2025-12-04T09:14:58.0516019Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.0516284Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.0516288Z 2025-12-04T09:14:58.0516457Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.0516593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0516670Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0516734Z unimplemented [] 2025-12-04T09:14:58.0516839Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0518068Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.0518320Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0518385Z graph_break [] 2025-12-04T09:14:58.0518519Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0519523Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.0519607Z current_size = base.storage().size() 2025-12-04T09:14:58.0519675Z Autotune Choices Stats: 2025-12-04T09:14:58.0521080Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.0521326Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0521559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0521879Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0523012Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0524152Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0525264Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0526403Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0527576Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0528691Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0528939Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.0529013Z Autotune Choices Stats: 2025-12-04T09:14:58.0530447Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.0530890Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0531218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0531786Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0533055Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0534219Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0535419Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0536617Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0537765Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0538922Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0540073Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0541249Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0542403Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0543583Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0543896Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.0544027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0544102Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0544165Z unimplemented [] 2025-12-04T09:14:58.0544268Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0544454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0545630Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0545699Z graph_break [] 2025-12-04T09:14:58.0545834Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0545906Z Autotune Choices Stats: 2025-12-04T09:14:58.0547297Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0547546Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0547762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0548076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0549242Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0550355Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0551470Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0552676Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0553788Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0554900Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0555143Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.0555213Z Autotune Choices Stats: 2025-12-04T09:14:58.0556638Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0557073Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0557429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0557996Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0559163Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0560367Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0561584Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0562912Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0564072Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0565225Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0566442Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0567598Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0568753Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0570000Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0570299Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.0570430Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0570504Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0570580Z unimplemented [] 2025-12-04T09:14:58.0570688Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0570876Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0572058Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0572123Z graph_break [] 2025-12-04T09:14:58.0572252Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0572386Z Autotune Choices Stats: 2025-12-04T09:14:58.0573802Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0574048Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0574269Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0574585Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0575757Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0576874Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0578056Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0579210Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0580314Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0581430Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0581674Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.0581752Z Autotune Choices Stats: 2025-12-04T09:14:58.0583182Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0583651Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0583976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0584539Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0585715Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0586963Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0588113Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0589275Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0590425Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0591575Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0592751Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0593895Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0595104Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0596278Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0596524Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.0596654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0596730Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0596794Z unimplemented [] 2025-12-04T09:14:58.0596898Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0597087Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0598263Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0598329Z graph_break [] 2025-12-04T09:14:58.0598456Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0598525Z Autotune Choices Stats: 2025-12-04T09:14:58.0599930Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0600207Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0600430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0600746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0601868Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0603027Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0604225Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0605347Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0606456Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0607577Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0607822Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.0607892Z Autotune Choices Stats: 2025-12-04T09:14:58.0609347Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0609782Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0610106Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0610683Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0611911Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0613139Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0614298Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0615458Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0616614Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0617794Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0618943Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0620120Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0621347Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0622496Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0622747Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.0622875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0622948Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0623010Z unimplemented [] 2025-12-04T09:14:58.0623112Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0623299Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0624473Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0624538Z graph_break [] 2025-12-04T09:14:58.0624666Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0624733Z Autotune Choices Stats: 2025-12-04T09:14:58.0626152Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0626401Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0626623Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0626935Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0628068Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0629282Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0630396Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0631508Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0632618Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0633737Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0633980Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.0634061Z Autotune Choices Stats: 2025-12-04T09:14:58.0635540Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0635980Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0636304Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0636967Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0638163Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0639331Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0640475Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0641630Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0642810Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0643958Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0645105Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0646313Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0647491Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0648641Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0648902Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.0649037Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0649115Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0649177Z unimplemented [] 2025-12-04T09:14:58.0649281Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0649475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0650650Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0650715Z graph_break [] 2025-12-04T09:14:58.0650843Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0650913Z Autotune Choices Stats: 2025-12-04T09:14:58.0652383Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.0652631Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0652855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0653212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0654373Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0655520Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0656639Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0657754Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0658875Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0660025Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0660273Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.0660341Z Autotune Choices Stats: 2025-12-04T09:14:58.0661765Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0662443Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0662899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0663468Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0664661Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0665829Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0666988Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0668147Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0669350Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0670502Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0671714Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0672896Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0674057Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0675208Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0675464Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.0675600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0675676Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0675738Z unimplemented [] 2025-12-04T09:14:58.0675843Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0676031Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0677248Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0677319Z graph_break [] 2025-12-04T09:14:58.0677452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0677520Z Autotune Choices Stats: 2025-12-04T09:14:58.0678925Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0679205Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0679470Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0679820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0680958Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0682081Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0683206Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0684320Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0685470Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0686593Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0686835Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.0686900Z Autotune Choices Stats: 2025-12-04T09:14:58.0688379Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.0688889Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0689212Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0689774Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0690944Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0692111Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0693307Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0694516Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0695673Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0696861Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0698091Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0699237Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0700389Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0701539Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0701791Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.0701919Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0701996Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0702057Z unimplemented [] 2025-12-04T09:14:58.0702159Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0702383Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0703561Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0703629Z graph_break [] 2025-12-04T09:14:58.0703757Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0703823Z Autotune Choices Stats: 2025-12-04T09:14:58.0705493Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.0705822Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0706046Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0706361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0707502Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0708623Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0709741Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0710855Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0712006Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0713130Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0713407Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.0713472Z Autotune Choices Stats: 2025-12-04T09:14:58.0714928Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0715393Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0715730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0716293Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0717462Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0718634Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0719824Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0720989Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0722136Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0723385Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0724530Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0725681Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0726833Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0727976Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0728228Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.0728391Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0728470Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0728534Z unimplemented [] 2025-12-04T09:14:58.0728637Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0728828Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0730000Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0730113Z graph_break [] 2025-12-04T09:14:58.0730243Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0730307Z Autotune Choices Stats: 2025-12-04T09:14:58.0731725Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0731999Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0732221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0732572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0733706Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0734817Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0735936Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0737083Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0738201Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0739350Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0739655Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.0739721Z Autotune Choices Stats: 2025-12-04T09:14:58.0741149Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0741582Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0741909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0742468Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0743632Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0744780Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0745970Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0747125Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0748334Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0749534Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0750683Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0751835Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0752988Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0754162Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0754413Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.0754542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0754610Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0754675Z unimplemented [] 2025-12-04T09:14:58.0754777Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0754961Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0756165Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0756292Z graph_break [] 2025-12-04T09:14:58.0756418Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0756483Z Autotune Choices Stats: 2025-12-04T09:14:58.0757868Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0758112Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0758336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0758651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0759784Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0760938Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0762319Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0763507Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0764646Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0765919Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0766166Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.0766232Z Autotune Choices Stats: 2025-12-04T09:14:58.0767668Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0768102Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0768447Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0769019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0770190Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0771416Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0772639Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0773834Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0775046Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0776204Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0777389Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0778552Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0779744Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0780894Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0781143Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.0781275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0781344Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0781445Z unimplemented [] 2025-12-04T09:14:58.0781550Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0781738Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0782976Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0783037Z graph_break [] 2025-12-04T09:14:58.0783171Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0783242Z Autotune Choices Stats: 2025-12-04T09:14:58.0784640Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.0784885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0785115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0785436Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0786577Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0787724Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0788890Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0790003Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0791180Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0792345Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0792595Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.0792662Z Autotune Choices Stats: 2025-12-04T09:14:58.0794086Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0794520Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0794849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0795421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0796622Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0797775Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0798958Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0800174Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0801327Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0802515Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0803662Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0804814Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0805999Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0807147Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0807426Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.0807554Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0807692Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0807760Z unimplemented [] 2025-12-04T09:14:58.0807862Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0808050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0809261Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0809324Z graph_break [] 2025-12-04T09:14:58.0809453Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0809521Z Autotune Choices Stats: 2025-12-04T09:14:58.0810909Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.0811151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0811371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0811686Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0812854Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0814005Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0815128Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0816309Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0817454Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0818602Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0818848Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.0818915Z Autotune Choices Stats: 2025-12-04T09:14:58.0820355Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0820788Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0821117Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0821713Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0822877Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0824032Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0825266Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0826456Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0827608Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0828768Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0829916Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0831136Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0832291Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0833467Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0833777Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.0833904Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0833972Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0834043Z unimplemented [] 2025-12-04T09:14:58.0834145Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0834332Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0835506Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0835568Z graph_break [] 2025-12-04T09:14:58.0835700Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0835766Z Autotune Choices Stats: 2025-12-04T09:14:58.0837158Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0837401Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0837621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0837934Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0839107Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0840223Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0841369Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0842559Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0843679Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0844795Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0845039Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.0845106Z Autotune Choices Stats: 2025-12-04T09:14:58.0846541Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.0846976Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0847339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0847901Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0849070Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0850283Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0851468Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0852696Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0853853Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0855006Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0856186Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0857348Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0858498Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0859769Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0860017Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.0860146Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0860216Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0860282Z unimplemented [] 2025-12-04T09:14:58.0860381Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0860566Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0861743Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0861965Z graph_break [] 2025-12-04T09:14:58.0862106Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0862178Z Autotune Choices Stats: 2025-12-04T09:14:58.0863585Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0863827Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0864047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0864433Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0865574Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0866691Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0867945Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0869059Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0870182Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0871309Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0871560Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.0871627Z Autotune Choices Stats: 2025-12-04T09:14:58.0873090Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0873523Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0873855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0874412Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0875617Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0876830Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0877987Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0879140Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0880290Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0881452Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0882631Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0883786Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0884993Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0886171Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0886425Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.0886561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0886632Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0886702Z unimplemented [] 2025-12-04T09:14:58.0886809Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0886994Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0888174Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0888238Z graph_break [] 2025-12-04T09:14:58.0888380Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0888448Z Autotune Choices Stats: 2025-12-04T09:14:58.0889903Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0890149Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0890377Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0890691Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0891813Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0893067Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0894216Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0895331Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0896446Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0897559Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0897811Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.0897876Z Autotune Choices Stats: 2025-12-04T09:14:58.0899345Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.0899783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0900112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0900706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0901979Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0903130Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0904283Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0905445Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0906605Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0907791Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0908937Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0910157Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0911340Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0912490Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0912740Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.0912870Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0912940Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0913010Z unimplemented [] 2025-12-04T09:14:58.0913111Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0913300Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0914478Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0914539Z graph_break [] 2025-12-04T09:14:58.0914670Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0914736Z Autotune Choices Stats: 2025-12-04T09:14:58.0916160Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0916404Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0916627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0916941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0918102Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0919296Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0920431Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0921540Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0922663Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0923776Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0924058Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.0924125Z Autotune Choices Stats: 2025-12-04T09:14:58.0925555Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.0925987Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0926379Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0926972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0928141Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0929301Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0930454Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0931609Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0932869Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0934030Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0935215Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0936431Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0937578Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0938729Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0938981Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.0939112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0939181Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0939252Z unimplemented [] 2025-12-04T09:14:58.0939356Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0939538Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0940724Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0940786Z graph_break [] 2025-12-04T09:14:58.0940950Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0941019Z Autotune Choices Stats: 2025-12-04T09:14:58.0942403Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0942648Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0942908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0943256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0944411Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0945539Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0946664Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0947772Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0948901Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0950048Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0950297Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.0950365Z Autotune Choices Stats: 2025-12-04T09:14:58.0951794Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0952339Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0952673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0953233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0954406Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0955559Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0956717Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0957906Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0959060Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0960217Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0961427Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0962805Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0963967Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0965122Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0965375Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.0965504Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0965575Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0965645Z unimplemented [] 2025-12-04T09:14:58.0965748Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0965931Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0967177Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0967245Z graph_break [] 2025-12-04T09:14:58.0967379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0967447Z Autotune Choices Stats: 2025-12-04T09:14:58.0968846Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.0969236Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0969460Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0969785Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0970908Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0972025Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0973186Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.0974299Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.0975451Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0976565Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0976813Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.0976880Z Autotune Choices Stats: 2025-12-04T09:14:58.0978384Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.0978858Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0979189Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0979755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0980923Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0982076Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0983246Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0984430Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0985580Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0986806Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.0987986Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.0989134Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.0990282Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0991435Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.0991685Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.0991813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.0991883Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.0991950Z unimplemented [] 2025-12-04T09:14:58.0992083Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.0992268Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.0993450Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.0993510Z graph_break [] 2025-12-04T09:14:58.0993641Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.0993708Z Autotune Choices Stats: 2025-12-04T09:14:58.0995159Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.0995433Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.0995649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.0995974Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.0997105Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0998216Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.0999338Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1000490Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1001610Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1002721Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1002999Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.1003134Z Autotune Choices Stats: 2025-12-04T09:14:58.1004564Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1004998Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1005330Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1005889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1007056Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1008217Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1009416Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1010564Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1011749Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1013007Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1014160Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1015317Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1016470Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1017672Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1017925Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.1018068Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1018139Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1018212Z unimplemented [] 2025-12-04T09:14:58.1018324Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1018511Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1019692Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1019789Z graph_break [] 2025-12-04T09:14:58.1019924Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1020057Z Autotune Choices Stats: 2025-12-04T09:14:58.1021446Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1021701Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1021919Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1022246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1023385Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1024504Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1025626Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1026765Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1027883Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1029052Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1029336Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.1029403Z Autotune Choices Stats: 2025-12-04T09:14:58.1030845Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1031283Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1031613Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1032174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1033341Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1034528Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1035685Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1036830Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1038118Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1039280Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1040425Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1041573Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1042720Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1043904Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1044152Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.1044285Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1044365Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1044437Z unimplemented [] 2025-12-04T09:14:58.1044540Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1044723Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1045960Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1046069Z graph_break [] 2025-12-04T09:14:58.1046200Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1046267Z Autotune Choices Stats: 2025-12-04T09:14:58.1047651Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1047900Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1048117Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1048439Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1049568Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1050690Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1051846Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1053010Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1054163Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1055334Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1055584Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.1055655Z Autotune Choices Stats: 2025-12-04T09:14:58.1057086Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.1057521Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1057853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1058416Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1059591Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1060776Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1062088Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1063363Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1064572Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1065735Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1066888Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1068045Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1069247Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1070413Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1070659Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.1070828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1070897Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1070966Z unimplemented [] 2025-12-04T09:14:58.1071068Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1071319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1072501Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1072562Z graph_break [] 2025-12-04T09:14:58.1072698Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1072766Z Autotune Choices Stats: 2025-12-04T09:14:58.1074149Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1074399Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1074616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1074951Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1076083Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1077232Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1078343Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1079471Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1080694Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1081805Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1082054Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.1082124Z Autotune Choices Stats: 2025-12-04T09:14:58.1083558Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1083991Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1084335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1084900Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1086100Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1087258Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1088471Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1089647Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1090811Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1091962Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1093155Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1094343Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1095496Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1096647Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1096986Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.1097121Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1097189Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1097251Z unimplemented [] 2025-12-04T09:14:58.1097358Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1097553Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1098731Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1098795Z graph_break [] 2025-12-04T09:14:58.1098925Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1098992Z Autotune Choices Stats: 2025-12-04T09:14:58.1100377Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1100628Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1100850Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1101167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1102323Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1103439Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1104549Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1105735Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1106901Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1108016Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1108264Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.1108332Z Autotune Choices Stats: 2025-12-04T09:14:58.1109768Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.1110207Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1110535Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1111131Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1112307Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1113504Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1114719Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1115869Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1117040Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1118190Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1119355Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1120540Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1121691Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1122907Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1123185Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.1123319Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1123390Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1123454Z unimplemented [] 2025-12-04T09:14:58.1123566Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1123752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1124927Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1124989Z graph_break [] 2025-12-04T09:14:58.1125122Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1125190Z Autotune Choices Stats: 2025-12-04T09:14:58.1126577Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1126830Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1127049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1127370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1128533Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1129664Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1130837Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1131984Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1133168Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1134281Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1134531Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.1134601Z Autotune Choices Stats: 2025-12-04T09:14:58.1136017Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1136493Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1136828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1137394Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1138554Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1139779Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1140966Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1142117Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1143269Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1144421Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1145620Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1146784Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1148255Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1149485Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1149732Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.1149915Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.1149995Z Traceback (most recent call last): 2025-12-04T09:14:58.1150296Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.1150369Z self.assertTrue( 2025-12-04T09:14:58.1150570Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.1150658Z raise self.failureException(msg) 2025-12-04T09:14:58.1150898Z AssertionError: False is not true : Log file /tmp/tmp0r9jrtv3/flex_attention_configs.json was not created 2025-12-04T09:14:58.1150903Z 2025-12-04T09:14:58.1151036Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.1151296Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.1151303Z 2025-12-04T09:14:58.1151465Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.1151604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1151685Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1151747Z unimplemented [] 2025-12-04T09:14:58.1151859Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1153042Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.1153269Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1153332Z graph_break [] 2025-12-04T09:14:58.1153458Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1154454Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.1154534Z current_size = base.storage().size() 2025-12-04T09:14:58.1154607Z Autotune Choices Stats: 2025-12-04T09:14:58.1156034Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.1156350Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1156571Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1156890Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1158033Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1159144Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1160255Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1161371Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1162705Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1163851Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1164097Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.1164218Z Autotune Choices Stats: 2025-12-04T09:14:58.1165689Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.1166188Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1166517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1167083Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1168242Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1169410Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1170592Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1171742Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1172933Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1174183Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1175381Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1176530Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1177681Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1178831Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1179075Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.1179240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1179318Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1179382Z unimplemented [] 2025-12-04T09:14:58.1179487Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1179682Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1180856Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1180923Z graph_break [] 2025-12-04T09:14:58.1181083Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1181155Z Autotune Choices Stats: 2025-12-04T09:14:58.1182570Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1182851Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1183072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1183388Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1184518Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1185623Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1186746Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1187894Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1189009Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1190130Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1190462Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.1190543Z Autotune Choices Stats: 2025-12-04T09:14:58.1191980Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1192420Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1192745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1193306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1194465Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1195619Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1196799Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1197949Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1199159Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1200338Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1201485Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1202630Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1203778Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1204952Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1205203Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.1205333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1205408Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1205472Z unimplemented [] 2025-12-04T09:14:58.1205574Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1205765Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1206939Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1207115Z graph_break [] 2025-12-04T09:14:58.1207244Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1207317Z Autotune Choices Stats: 2025-12-04T09:14:58.1208697Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1208951Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1209174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1209483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1210611Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1211733Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1212929Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1214056Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1215167Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1216344Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1216621Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.1216695Z Autotune Choices Stats: 2025-12-04T09:14:58.1218114Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1218553Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1218885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1219452Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1220611Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1221801Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1222948Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1224152Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1225359Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1226504Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1227652Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1228806Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1229962Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1231140Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1231392Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.1231521Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1231597Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1231659Z unimplemented [] 2025-12-04T09:14:58.1231764Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1231989Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1233196Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1233304Z graph_break [] 2025-12-04T09:14:58.1233435Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1233503Z Autotune Choices Stats: 2025-12-04T09:14:58.1234896Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1235142Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1235362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1235676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1236808Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1237917Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1239074Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1240190Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1241374Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1242522Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1242767Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.1242838Z Autotune Choices Stats: 2025-12-04T09:14:58.1244263Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1244701Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1245026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1245594Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1246791Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1247950Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1249112Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1250362Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1251510Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1259662Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1261024Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1262374Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1263643Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1264845Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1265183Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.1265331Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1265414Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1265530Z unimplemented [] 2025-12-04T09:14:58.1265697Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1265902Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1267128Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1267199Z graph_break [] 2025-12-04T09:14:58.1267337Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1267406Z Autotune Choices Stats: 2025-12-04T09:14:58.1268817Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1269074Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1269326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1269656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1270807Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1271958Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1273109Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1274250Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1275457Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1276574Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1276829Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.1276904Z Autotune Choices Stats: 2025-12-04T09:14:58.1278333Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1278783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1279110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1279671Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1280879Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1282047Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1283255Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1284448Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1285586Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1286772Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1287918Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1289100Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1290253Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1291397Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1291757Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.1291911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1291987Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1292094Z unimplemented [] 2025-12-04T09:14:58.1292214Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1292463Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1293676Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1293745Z graph_break [] 2025-12-04T09:14:58.1293887Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1293955Z Autotune Choices Stats: 2025-12-04T09:14:58.1295368Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.1295619Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1295846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1296163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1297357Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1298483Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1299618Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1300831Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1301955Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1303072Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1303327Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.1303394Z Autotune Choices Stats: 2025-12-04T09:14:58.1304846Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1305289Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1305662Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1306226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1307424Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1308617Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1309844Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1311010Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1312165Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1313321Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1314509Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1315669Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1316831Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1318063Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1318348Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.1318484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1318555Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1318627Z unimplemented [] 2025-12-04T09:14:58.1318732Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1318926Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1320117Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1320179Z graph_break [] 2025-12-04T09:14:58.1320326Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1320395Z Autotune Choices Stats: 2025-12-04T09:14:58.1321808Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1322056Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1322282Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1322601Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1323774Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1324903Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1326091Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1327236Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1328355Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1329471Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1329726Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.1329798Z Autotune Choices Stats: 2025-12-04T09:14:58.1331236Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.1331709Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1332046Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1332664Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1333846Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1335103Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1336270Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1337430Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1338585Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1339754Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1340949Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1342149Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1343395Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1344586Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1344849Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.1344989Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1345062Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1345132Z unimplemented [] 2025-12-04T09:14:58.1345242Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1345430Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1346626Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1346688Z graph_break [] 2025-12-04T09:14:58.1346837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1346910Z Autotune Choices Stats: 2025-12-04T09:14:58.1348330Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.1348621Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1348861Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1349187Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1350333Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1351549Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1352707Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1353835Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1354966Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1356092Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1356352Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.1356420Z Autotune Choices Stats: 2025-12-04T09:14:58.1357908Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1358352Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1358691Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1359263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1360509Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1361707Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1363091Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1364272Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1365447Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1366679Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1367852Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1369074Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1370325Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1371496Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1371755Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.1371889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1371960Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1372043Z unimplemented [] 2025-12-04T09:14:58.1372149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1372405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1373608Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1373673Z graph_break [] 2025-12-04T09:14:58.1373810Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1373877Z Autotune Choices Stats: 2025-12-04T09:14:58.1375336Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1375587Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1375812Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1376132Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1377311Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1378503Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1379641Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1380769Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1381900Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1383028Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1383278Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.1383410Z Autotune Choices Stats: 2025-12-04T09:14:58.1384865Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1385302Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1385638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1386296Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1387485Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1388653Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1389825Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1390988Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1392231Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1393403Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1394563Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1395825Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1396983Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1398153Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1398405Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.1398539Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1398608Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1398675Z unimplemented [] 2025-12-04T09:14:58.1398788Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1398986Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1400178Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1400239Z graph_break [] 2025-12-04T09:14:58.1400373Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1400474Z Autotune Choices Stats: 2025-12-04T09:14:58.1401892Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1402136Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1402352Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1402712Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1403879Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1405039Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1406167Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1407292Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1408426Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1409591Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1409862Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.1409932Z Autotune Choices Stats: 2025-12-04T09:14:58.1411389Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1411889Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1412268Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1412888Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1414074Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1415240Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1416410Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1417573Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1418774Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1419939Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1421158Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1422353Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1423511Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1424679Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1424925Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.1425062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1425131Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1425201Z unimplemented [] 2025-12-04T09:14:58.1425307Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1425489Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1426712Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1426778Z graph_break [] 2025-12-04T09:14:58.1426926Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1426994Z Autotune Choices Stats: 2025-12-04T09:14:58.1428405Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.1428684Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1428965Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1429289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1430426Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1431559Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1432697Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1433829Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1434999Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1436121Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1436375Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.1436442Z Autotune Choices Stats: 2025-12-04T09:14:58.1437920Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1438414Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1438745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1439314Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1440505Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1441679Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1442849Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1444040Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1445209Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1446404Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1447642Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1448817Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1449979Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1451146Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1451396Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.1451531Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1451600Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1451672Z unimplemented [] 2025-12-04T09:14:58.1451775Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1451994Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1453239Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1453305Z graph_break [] 2025-12-04T09:14:58.1453441Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1453509Z Autotune Choices Stats: 2025-12-04T09:14:58.1454951Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.1455267Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1455483Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1455809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1456955Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1458082Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1459220Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1460351Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1461516Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1462806Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1463127Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.1463197Z Autotune Choices Stats: 2025-12-04T09:14:58.1464730Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1465169Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1465504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1466070Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1467248Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1468408Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1469624Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1470793Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1471959Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1473226Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1474387Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1475552Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1476715Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1477905Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1478187Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.1478325Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1478393Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1478463Z unimplemented [] 2025-12-04T09:14:58.1478567Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1478752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1479951Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1480063Z graph_break [] 2025-12-04T09:14:58.1480201Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1480268Z Autotune Choices Stats: 2025-12-04T09:14:58.1481734Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1481999Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1482221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1482546Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1483690Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1484827Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1485958Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1487116Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1488251Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1489413Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1489726Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.1489792Z Autotune Choices Stats: 2025-12-04T09:14:58.1491238Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.1491676Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1492010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1492640Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1493838Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1495042Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1496213Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1497374Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1498613Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1499818Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1500982Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1502149Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1503319Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1504517Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1504766Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.1504902Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1504969Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1505035Z unimplemented [] 2025-12-04T09:14:58.1505137Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1505322Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1506543Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1506675Z graph_break [] 2025-12-04T09:14:58.1506812Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1506880Z Autotune Choices Stats: 2025-12-04T09:14:58.1508281Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1508538Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1508764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1509087Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1510231Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1511366Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1512546Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1513676Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1514805Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1516022Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1516272Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.1516336Z Autotune Choices Stats: 2025-12-04T09:14:58.1517785Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1518231Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1518563Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1519133Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1520320Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1521515Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1522684Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1523909Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1525147Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1526313Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1527478Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1528641Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1529833Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1531000Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1531246Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.1531379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1531479Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1531545Z unimplemented [] 2025-12-04T09:14:58.1531648Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1531830Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1533135Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1533201Z graph_break [] 2025-12-04T09:14:58.1533340Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1533408Z Autotune Choices Stats: 2025-12-04T09:14:58.1534825Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1535076Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1535295Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1535619Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1536766Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1537941Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1539076Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1540209Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1541403Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1542582Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1542836Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.1542903Z Autotune Choices Stats: 2025-12-04T09:14:58.1544348Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.1544787Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1545135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1545701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1546916Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1548087Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1549298Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1550521Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1551687Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1552856Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1554023Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1555194Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1556400Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1557568Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1557844Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.1558045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1558115Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1558179Z unimplemented [] 2025-12-04T09:14:58.1558288Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1558474Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1559675Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1559736Z graph_break [] 2025-12-04T09:14:58.1559872Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1559940Z Autotune Choices Stats: 2025-12-04T09:14:58.1561339Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1561595Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1561949Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1562281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1563425Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1564622Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1565767Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1566986Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1568166Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1569310Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1569561Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.1569628Z Autotune Choices Stats: 2025-12-04T09:14:58.1571084Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.1571523Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1571856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1572496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1573706Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1574873Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1576149Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1577306Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1578471Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1579643Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1580819Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1582017Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1583178Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1584407Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1584685Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.1584822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1584892Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1584955Z unimplemented [] 2025-12-04T09:14:58.1585063Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1585246Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1586442Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1586504Z graph_break [] 2025-12-04T09:14:58.1586638Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1586705Z Autotune Choices Stats: 2025-12-04T09:14:58.1588119Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1588372Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1588592Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1588919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1590108Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1591247Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1592405Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1593596Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1594732Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1595863Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1596117Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.1596187Z Autotune Choices Stats: 2025-12-04T09:14:58.1597635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1598073Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1598443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1599013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1600199Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1601748Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1602952Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1604125Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1605299Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1606463Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1607666Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1608844Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1610055Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1611284Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1611533Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.1611674Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1611743Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1611807Z unimplemented [] 2025-12-04T09:14:58.1611918Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1612103Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1613351Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1613412Z graph_break [] 2025-12-04T09:14:58.1613544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1613618Z Autotune Choices Stats: 2025-12-04T09:14:58.1615017Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.1615270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1615528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1615854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1616992Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1618137Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1619369Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1620506Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1621644Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1622770Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1623022Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.1623088Z Autotune Choices Stats: 2025-12-04T09:14:58.1624564Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1625012Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1625341Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1625909Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1627115Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1628351Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1629528Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1630693Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1631861Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1633052Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1634222Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1635403Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1636645Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1637852Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1638106Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.1638244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1638313Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1638375Z unimplemented [] 2025-12-04T09:14:58.1638482Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1638665Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1639868Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1639939Z graph_break [] 2025-12-04T09:14:58.1640068Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1640142Z Autotune Choices Stats: 2025-12-04T09:14:58.1641606Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1641859Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1642082Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1642404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1643544Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1644737Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1645891Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1647018Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1648143Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1649273Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1649522Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.1649588Z Autotune Choices Stats: 2025-12-04T09:14:58.1651062Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1651508Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1651834Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1652482Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1653740Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1654911Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1656080Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1657239Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1658407Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1659611Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1660775Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1662229Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1663483Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1664664Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1664913Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.1665050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1665121Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1665187Z unimplemented [] 2025-12-04T09:14:58.1665294Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1665480Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1666675Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1666738Z graph_break [] 2025-12-04T09:14:58.1666866Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1666938Z Autotune Choices Stats: 2025-12-04T09:14:58.1668403Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1668663Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1668889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1669216Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1670433Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1671629Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1672760Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1673887Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1675013Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1676180Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1676433Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.1676500Z Autotune Choices Stats: 2025-12-04T09:14:58.1677944Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1678387Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1678779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1679380Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1680552Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1681718Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1682886Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1684045Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1685244Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1686405Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1687601Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1688826Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1689982Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1691152Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1691409Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.1691547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1691616Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1691681Z unimplemented [] 2025-12-04T09:14:58.1691789Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1691974Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1693228Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1693331Z graph_break [] 2025-12-04T09:14:58.1693464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1693544Z Autotune Choices Stats: 2025-12-04T09:14:58.1694957Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1695207Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1695463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1695855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1696991Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1698120Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1699247Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1700394Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1701522Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1702693Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1702951Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.1703020Z Autotune Choices Stats: 2025-12-04T09:14:58.1704459Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.1705043Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1705378Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1705953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1707129Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1708308Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1709482Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1710679Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1711855Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1713010Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1714265Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1715432Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1716595Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1717766Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1718015Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.1718152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1718228Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1718291Z unimplemented [] 2025-12-04T09:14:58.1718401Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1718584Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1719809Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1719874Z graph_break [] 2025-12-04T09:14:58.1720001Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1720074Z Autotune Choices Stats: 2025-12-04T09:14:58.1721470Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1721827Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1722048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1722372Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1723507Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1724642Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1725762Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1726902Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1728064Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1729204Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1729449Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.1729555Z Autotune Choices Stats: 2025-12-04T09:14:58.1731020Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1731496Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1731826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1732452Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1733649Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1734820Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1735988Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1737187Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1738356Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1739601Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1740793Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1741962Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1743119Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1744283Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1744527Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.1744666Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1744737Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1744831Z unimplemented [] 2025-12-04T09:14:58.1744940Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1745121Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1746309Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1746367Z graph_break [] 2025-12-04T09:14:58.1746495Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1746602Z Autotune Choices Stats: 2025-12-04T09:14:58.1748034Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1748320Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1748542Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1748869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1750002Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1751131Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1752259Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1753421Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1754545Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1755685Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1756022Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.1756094Z Autotune Choices Stats: 2025-12-04T09:14:58.1757531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.1757985Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1758321Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1758889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1760063Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1761234Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1762641Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1763819Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1765034Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1766286Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1767460Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1768621Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1769801Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1771023Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1771271Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.1771407Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1771475Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1771538Z unimplemented [] 2025-12-04T09:14:58.1771648Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1771831Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1773108Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1773213Z graph_break [] 2025-12-04T09:14:58.1773406Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1773479Z Autotune Choices Stats: 2025-12-04T09:14:58.1774880Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1775131Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1775349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1775672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1776805Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1777936Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1779061Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1780227Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1781351Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1782561Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1782837Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.1782910Z Autotune Choices Stats: 2025-12-04T09:14:58.1784352Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1784797Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1785124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1785695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1786876Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1788092Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1789266Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1790464Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1791687Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1792851Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1794018Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1795181Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1796347Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1797549Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1797797Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.1797933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1798002Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1798065Z unimplemented [] 2025-12-04T09:14:58.1798170Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1798353Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1799654Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1799769Z graph_break [] 2025-12-04T09:14:58.1799896Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1799969Z Autotune Choices Stats: 2025-12-04T09:14:58.1803279Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.1803589Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1803824Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1804148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1805324Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1806499Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1807654Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1808794Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1809997Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1811192Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1811454Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.1811521Z Autotune Choices Stats: 2025-12-04T09:14:58.1813111Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1813557Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1813902Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1814479Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1815664Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1816826Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1817994Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1819217Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1820410Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1821620Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1822775Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1823942Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1825100Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1826262Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1826512Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.1826733Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.1826810Z Traceback (most recent call last): 2025-12-04T09:14:58.1827154Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.1827253Z self.assertTrue( 2025-12-04T09:14:58.1827453Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.1827544Z raise self.failureException(msg) 2025-12-04T09:14:58.1827787Z AssertionError: False is not true : Log file /tmp/tmpwwy855w8/flex_attention_configs.json was not created 2025-12-04T09:14:58.1827793Z 2025-12-04T09:14:58.1827931Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.1828197Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.1828203Z 2025-12-04T09:14:58.1828368Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.1828511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1828595Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1828664Z unimplemented [] 2025-12-04T09:14:58.1828788Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1830030Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.1830227Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1830289Z graph_break [] 2025-12-04T09:14:58.1830421Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1831435Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.1831517Z current_size = base.storage().size() 2025-12-04T09:14:58.1831589Z Autotune Choices Stats: 2025-12-04T09:14:58.1832997Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.1833253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1833475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1833802Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1834934Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1836170Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1837285Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1838442Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1839556Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1840679Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1840927Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.1841000Z Autotune Choices Stats: 2025-12-04T09:14:58.1842436Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.1842876Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1843205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1843836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1845030Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1846236Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1847390Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1848547Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1849705Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1850856Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1852028Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1853292Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1854487Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1855690Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1855935Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.1856066Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1856146Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1856208Z unimplemented [] 2025-12-04T09:14:58.1856316Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1856505Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1857689Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1857754Z graph_break [] 2025-12-04T09:14:58.1857880Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1857957Z Autotune Choices Stats: 2025-12-04T09:14:58.1859357Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1859604Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1859821Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1860174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1861351Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1862673Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1863877Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1865017Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1866148Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1867275Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1867529Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.1867600Z Autotune Choices Stats: 2025-12-04T09:14:58.1869036Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1869548Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1869976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1870547Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1871722Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1872934Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1874107Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1875276Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1876439Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1877614Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1878853Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1880040Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1881238Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1882399Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1882647Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.1882784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1882867Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1882930Z unimplemented [] 2025-12-04T09:14:58.1883040Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1883233Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1884421Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1884491Z graph_break [] 2025-12-04T09:14:58.1884628Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1884701Z Autotune Choices Stats: 2025-12-04T09:14:58.1886108Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1886400Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1886656Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1887012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1888164Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1889339Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1890469Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1891598Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1892766Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1893897Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1894143Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.1894214Z Autotune Choices Stats: 2025-12-04T09:14:58.1895688Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1896197Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1896522Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1897089Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1898315Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1899483Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1900645Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1901805Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1902969Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1904153Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1905377Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1906570Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1907729Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1908904Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1909151Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.1909280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1909356Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1909416Z unimplemented [] 2025-12-04T09:14:58.1909515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1909706Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1910893Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1910958Z graph_break [] 2025-12-04T09:14:58.1911083Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1911155Z Autotune Choices Stats: 2025-12-04T09:14:58.1912585Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1912893Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1913111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1913427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1914560Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1915714Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1916847Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1917976Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1919107Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1920238Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1920481Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.1920582Z Autotune Choices Stats: 2025-12-04T09:14:58.1922046Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1922511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1922837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1923441Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1924606Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1925773Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1926931Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1928148Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1929312Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1930542Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1931752Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1933055Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1934243Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1935408Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.1935652Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.1935789Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1935858Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1935923Z unimplemented [] 2025-12-04T09:14:58.1936042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1936232Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1937411Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1937470Z graph_break [] 2025-12-04T09:14:58.1937634Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1937705Z Autotune Choices Stats: 2025-12-04T09:14:58.1939124Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1939403Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1939620Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1939940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1941097Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1942216Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1943327Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1944438Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1945550Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1946655Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1947001Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.1947068Z Autotune Choices Stats: 2025-12-04T09:14:58.1948484Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1948925Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1949293Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1949858Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1951014Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1952170Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1953329Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1954472Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1955684Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1956863Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1958044Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1959197Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1960345Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1961491Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1961736Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.1962055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1962130Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1962194Z unimplemented [] 2025-12-04T09:14:58.1962306Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1962492Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1963663Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1963899Z graph_break [] 2025-12-04T09:14:58.1964031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1964103Z Autotune Choices Stats: 2025-12-04T09:14:58.1965503Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.1965755Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1966021Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1966345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1967467Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1968580Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1969696Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1970818Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1971924Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1973192Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1973474Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.1973541Z Autotune Choices Stats: 2025-12-04T09:14:58.1974998Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.1975440Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1975763Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1976330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1977494Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1978772Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1979943Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1981123Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1982331Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1983510Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.1984661Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1985821Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.1986967Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.1988130Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1988379Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.1988508Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.1988577Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.1988649Z unimplemented [] 2025-12-04T09:14:58.1988754Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.1988982Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.1990197Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.1990291Z graph_break [] 2025-12-04T09:14:58.1990424Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.1990492Z Autotune Choices Stats: 2025-12-04T09:14:58.1991919Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.1992166Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.1992390Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.1992716Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.1993845Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1994957Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1996084Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.1997200Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.1998396Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1999542Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.1999794Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.1999858Z Autotune Choices Stats: 2025-12-04T09:14:58.2001313Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.2001743Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2002073Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2002633Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2003797Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2004960Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2006121Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2007368Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2008515Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2009717Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2010866Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2012014Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2013223Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2014369Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2014652Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.2014780Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2014882Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2014982Z unimplemented [] 2025-12-04T09:14:58.2015084Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2015268Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2016444Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2016504Z graph_break [] 2025-12-04T09:14:58.2016652Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2016720Z Autotune Choices Stats: 2025-12-04T09:14:58.2018140Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.2018381Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2018603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2018917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2020050Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2021163Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2022290Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2023427Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2024601Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2025760Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2026006Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.2026071Z Autotune Choices Stats: 2025-12-04T09:14:58.2027494Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2027926Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2028251Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2028813Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2029977Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2031123Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2032335Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2033516Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2034698Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2035855Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2037003Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2038155Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2039309Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2040506Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2040827Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.2040959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2041029Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2041100Z unimplemented [] 2025-12-04T09:14:58.2041205Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2041389Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2042567Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2042664Z graph_break [] 2025-12-04T09:14:58.2042802Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2042868Z Autotune Choices Stats: 2025-12-04T09:14:58.2044263Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2044509Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2044736Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2045055Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2046174Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2047289Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2048404Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2049817Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2050938Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2052088Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2052454Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.2052524Z Autotune Choices Stats: 2025-12-04T09:14:58.2053958Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2054390Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2054719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2055282Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2056452Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2057636Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2058875Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2060056Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2061210Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2062595Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2063753Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2064904Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2066045Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2067304Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2067641Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.2067771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2067842Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2067908Z unimplemented [] 2025-12-04T09:14:58.2068010Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2068192Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2069422Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2069484Z graph_break [] 2025-12-04T09:14:58.2069615Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2069682Z Autotune Choices Stats: 2025-12-04T09:14:58.2071074Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2071316Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2071542Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2071865Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2072988Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2074105Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2075290Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2076427Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2077573Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2078681Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2078940Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.2079008Z Autotune Choices Stats: 2025-12-04T09:14:58.2080429Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2080856Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2081185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2081744Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2082910Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2084151Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2085303Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2086490Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2087635Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2088787Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2089934Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2091087Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2092411Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2093638Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2093889Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.2094030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2094100Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2094202Z unimplemented [] 2025-12-04T09:14:58.2094305Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2094492Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2095674Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2095735Z graph_break [] 2025-12-04T09:14:58.2095871Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2095938Z Autotune Choices Stats: 2025-12-04T09:14:58.2097332Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.2097578Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2097799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2098124Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2099266Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2100448Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2101594Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2102734Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2103847Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2104963Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2105215Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.2105284Z Autotune Choices Stats: 2025-12-04T09:14:58.2106725Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2107168Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2107503Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2108060Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2109303Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2110483Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2111670Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2112817Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2113983Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2115136Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2116285Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2117467Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2118676Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2119859Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2120103Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.2120240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2120312Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2120380Z unimplemented [] 2025-12-04T09:14:58.2120480Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2120664Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2121844Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2121904Z graph_break [] 2025-12-04T09:14:58.2122036Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2122101Z Autotune Choices Stats: 2025-12-04T09:14:58.2123489Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.2123742Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2123960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2124293Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2125485Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2126661Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2127774Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2128929Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2130048Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2131158Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2131403Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.2131471Z Autotune Choices Stats: 2025-12-04T09:14:58.2132957Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2133389Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2133755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2134374Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2135542Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2136745Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2137905Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2139054Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2140209Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2141370Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2142517Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2143759Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2144905Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2146093Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2146335Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.2146469Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2146536Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2146603Z unimplemented [] 2025-12-04T09:14:58.2146704Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2146888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2148064Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2148122Z graph_break [] 2025-12-04T09:14:58.2148254Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2148321Z Autotune Choices Stats: 2025-12-04T09:14:58.2149703Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2149949Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2150165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2150526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2151708Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2152824Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2153983Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2155099Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2156231Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2157335Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2157587Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.2157653Z Autotune Choices Stats: 2025-12-04T09:14:58.2159081Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.2159582Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2159944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2160501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2161699Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2163482Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2164652Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2165800Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2166946Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2168089Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2169369Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2170564Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2171769Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2173035Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2173291Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.2173435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2173508Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2173581Z unimplemented [] 2025-12-04T09:14:58.2173699Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2173888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2175070Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2175136Z graph_break [] 2025-12-04T09:14:58.2175276Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2175345Z Autotune Choices Stats: 2025-12-04T09:14:58.2176730Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2177026Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2177313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2177637Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2178761Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2179904Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2181014Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2182125Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2183228Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2184330Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2184578Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.2184644Z Autotune Choices Stats: 2025-12-04T09:14:58.2186111Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2186626Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2186954Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2187515Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2188721Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2189865Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2191021Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2192167Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2193312Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2194526Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2195692Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2196873Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2198017Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2199174Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2199418Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.2199555Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2199623Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2199684Z unimplemented [] 2025-12-04T09:14:58.2199794Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2199976Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2201159Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2201219Z graph_break [] 2025-12-04T09:14:58.2201350Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2201428Z Autotune Choices Stats: 2025-12-04T09:14:58.2202847Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2203205Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2203420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2203740Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2204897Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2206014Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2207121Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2208233Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2209356Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2210459Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2210738Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.2210850Z Autotune Choices Stats: 2025-12-04T09:14:58.2212377Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.2212833Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2213206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2213765Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2214926Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2216076Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2217234Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2218380Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2219566Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2220787Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2221928Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2223113Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2224257Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2225403Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2225648Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.2225787Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2225858Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2225933Z unimplemented [] 2025-12-04T09:14:58.2226046Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2226234Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2227427Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2227525Z graph_break [] 2025-12-04T09:14:58.2233588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2233788Z Autotune Choices Stats: 2025-12-04T09:14:58.2235293Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2235564Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2235796Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2236128Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2237308Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2238431Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2239552Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2240687Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2241825Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2243012Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2243308Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.2243383Z Autotune Choices Stats: 2025-12-04T09:14:58.2244830Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.2245312Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2245645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2246214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2247399Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2248566Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2249738Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2250894Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2252118Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2253398Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2254589Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2255739Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2256887Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2258037Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2258295Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.2258440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2258515Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2258579Z unimplemented [] 2025-12-04T09:14:58.2258694Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2258882Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2260133Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2260268Z graph_break [] 2025-12-04T09:14:58.2260401Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2260476Z Autotune Choices Stats: 2025-12-04T09:14:58.2262114Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2262484Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2262727Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2263059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2264197Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2265322Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2266434Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2267548Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2268719Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2269954Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2270207Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.2270278Z Autotune Choices Stats: 2025-12-04T09:14:58.2271747Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2272191Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2272518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2273087Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2274256Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2275420Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2276580Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2277795Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2278984Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2280165Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2281320Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2282475Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2283630Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2284793Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2285042Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.2285182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2285289Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2285352Z unimplemented [] 2025-12-04T09:14:58.2285466Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2285688Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2286907Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2286969Z graph_break [] 2025-12-04T09:14:58.2287099Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2287176Z Autotune Choices Stats: 2025-12-04T09:14:58.2288623Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.2288879Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2289098Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2289421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2290552Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2291668Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2292874Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2293999Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2295190Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2296340Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2296596Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.2296664Z Autotune Choices Stats: 2025-12-04T09:14:58.2298120Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2298563Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2298897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2299478Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2300647Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2301811Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2302994Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2304209Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2305396Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2306544Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2307700Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2308865Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2310016Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2311172Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2311452Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.2311656Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2311733Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2311798Z unimplemented [] 2025-12-04T09:14:58.2311911Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2312098Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2313278Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2313343Z graph_break [] 2025-12-04T09:14:58.2313471Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2313543Z Autotune Choices Stats: 2025-12-04T09:14:58.2314968Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2315220Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2315438Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2315761Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2316888Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2318006Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2319134Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2320330Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2321494Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2322649Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2322892Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.2322965Z Autotune Choices Stats: 2025-12-04T09:14:58.2324407Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2324846Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2325171Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2325738Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2326907Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2328063Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2329309Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2330458Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2331642Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2332856Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2334018Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2335175Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2336325Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2337554Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2337834Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.2337968Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2338037Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2338099Z unimplemented [] 2025-12-04T09:14:58.2338207Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2338395Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2339618Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2339682Z graph_break [] 2025-12-04T09:14:58.2339809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2339884Z Autotune Choices Stats: 2025-12-04T09:14:58.2341275Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2341531Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2341746Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2342066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2343196Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2344324Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2345524Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2346702Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2347846Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2348964Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2349209Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.2349294Z Autotune Choices Stats: 2025-12-04T09:14:58.2350717Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2351158Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2351486Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2352053Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2353219Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2354458Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2355642Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2356821Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2357984Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2359145Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2360298Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2361451Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2362914Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2364210Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2364468Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.2364613Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2364686Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2364756Z unimplemented [] 2025-12-04T09:14:58.2364874Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2365132Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2366317Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2366391Z graph_break [] 2025-12-04T09:14:58.2366533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2366611Z Autotune Choices Stats: 2025-12-04T09:14:58.2368014Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2368271Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2368495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2368822Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2369959Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2371102Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2372420Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2373562Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2374709Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2375832Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2376088Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.2376163Z Autotune Choices Stats: 2025-12-04T09:14:58.2377596Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.2378042Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2378374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2378943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2380197Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2381418Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2382615Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2383781Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2384944Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2386099Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2387252Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2388398Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2389674Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2390830Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2391111Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.2391250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2391322Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2391387Z unimplemented [] 2025-12-04T09:14:58.2391506Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2391701Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2392890Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2392959Z graph_break [] 2025-12-04T09:14:58.2393091Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2393163Z Autotune Choices Stats: 2025-12-04T09:14:58.2394568Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2394822Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2395046Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2395370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2396500Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2397688Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2398838Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2399994Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2401107Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2402228Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2402470Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.2402549Z Autotune Choices Stats: 2025-12-04T09:14:58.2403979Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2404422Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2404750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2405382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2406576Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2407733Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2408915Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2410078Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2411233Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2412448Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2413622Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2414841Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2416027Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2417226Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2417475Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.2417606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2417682Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2417746Z unimplemented [] 2025-12-04T09:14:58.2417855Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2418039Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2419225Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2419293Z graph_break [] 2025-12-04T09:14:58.2419420Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2419490Z Autotune Choices Stats: 2025-12-04T09:14:58.2420883Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2421137Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2421361Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2421686Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2422903Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2424047Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2425195Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2426319Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2427429Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2428548Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2428800Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.2428887Z Autotune Choices Stats: 2025-12-04T09:14:58.2430311Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.2430786Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2431152Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2431752Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2432913Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2434102Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2435252Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2436420Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2437580Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2438752Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2439938Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2441153Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2442343Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2443491Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2443748Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.2443883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2443964Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2444030Z unimplemented [] 2025-12-04T09:14:58.2444144Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2444330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2445500Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2445571Z graph_break [] 2025-12-04T09:14:58.2445700Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2445775Z Autotune Choices Stats: 2025-12-04T09:14:58.2447172Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2447461Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2447681Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2448086Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2449218Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2450363Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2451476Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2452653Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2453769Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2454888Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2455138Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.2455211Z Autotune Choices Stats: 2025-12-04T09:14:58.2456672Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2457176Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2457504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2458074Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2459280Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2460438Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2461582Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2462958Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2464129Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2465285Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2466605Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2467755Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2468960Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2470115Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2470363Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.2470495Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2470573Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2470635Z unimplemented [] 2025-12-04T09:14:58.2470761Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2470958Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2472151Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2472221Z graph_break [] 2025-12-04T09:14:58.2472352Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2472428Z Autotune Choices Stats: 2025-12-04T09:14:58.2473873Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.2474195Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2474417Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2474739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2475882Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2477081Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2478209Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2479337Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2480448Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2481574Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2481828Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.2481937Z Autotune Choices Stats: 2025-12-04T09:14:58.2483430Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2483904Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2484235Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2484842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2486006Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2487174Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2488324Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2489489Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2490637Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2491847Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2493109Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2494298Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2495460Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2496619Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2496864Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.2496992Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2497070Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2497134Z unimplemented [] 2025-12-04T09:14:58.2497238Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2497433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2498611Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2498680Z graph_break [] 2025-12-04T09:14:58.2498818Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2499230Z Autotune Choices Stats: 2025-12-04T09:14:58.2500661Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.2500949Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2501176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2501492Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2502666Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2503775Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2504900Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2506017Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2507130Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2508246Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2508590Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.2508665Z Autotune Choices Stats: 2025-12-04T09:14:58.2510099Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2510540Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2510897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2511475Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2512643Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2513811Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2514965Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2516115Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2517354Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2518535Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2519733Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2520882Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2522038Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2523189Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2523444Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.2523620Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.2523706Z Traceback (most recent call last): 2025-12-04T09:14:58.2524004Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.2524073Z self.assertTrue( 2025-12-04T09:14:58.2524294Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.2524383Z raise self.failureException(msg) 2025-12-04T09:14:58.2524636Z AssertionError: False is not true : Log file /tmp/tmpne01nos0/flex_attention_configs.json was not created 2025-12-04T09:14:58.2524675Z 2025-12-04T09:14:58.2524815Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.2525068Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.2525106Z 2025-12-04T09:14:58.2525311Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.2525446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2525530Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2525593Z unimplemented [] 2025-12-04T09:14:58.2525704Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2526905Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.2527098Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2527169Z graph_break [] 2025-12-04T09:14:58.2527332Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2528331Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.2528417Z current_size = base.storage().size() 2025-12-04T09:14:58.2528489Z Autotune Choices Stats: 2025-12-04T09:14:58.2529907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.2530156Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2530386Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2530706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2531844Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2533023Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2534218Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2535367Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2536516Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2537632Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2537885Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.2537957Z Autotune Choices Stats: 2025-12-04T09:14:58.2539386Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.2539827Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2540167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2540731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2541904Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2543154Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2544311Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2545512Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2546658Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2547810Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2548966Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2550115Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2551339Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2552510Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2552781Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.2552922Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2552996Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2553069Z unimplemented [] 2025-12-04T09:14:58.2553214Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2553409Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2554586Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2554650Z graph_break [] 2025-12-04T09:14:58.2554784Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2554854Z Autotune Choices Stats: 2025-12-04T09:14:58.2556244Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2556491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2556724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2557045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2558181Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2559327Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2560503Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2561644Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2563006Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2564132Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2564387Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.2564455Z Autotune Choices Stats: 2025-12-04T09:14:58.2565890Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2566330Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2566663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2567227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2568524Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2569723Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2570925Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2572078Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2573281Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2574436Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2575587Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2576775Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2578003Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2579180Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2579439Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.2579571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2579641Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2579712Z unimplemented [] 2025-12-04T09:14:58.2579817Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2580010Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2581191Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2581257Z graph_break [] 2025-12-04T09:14:58.2581390Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2581458Z Autotune Choices Stats: 2025-12-04T09:14:58.2582853Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2583101Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2583341Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2583657Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2584791Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2586001Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2587121Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2588262Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2589375Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2590486Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2590739Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.2590820Z Autotune Choices Stats: 2025-12-04T09:14:58.2592254Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2592690Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2593024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2593686Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2594852Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2596032Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2597191Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2598343Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2599486Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2600637Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2601787Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2603000Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2604181Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2605362Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2605616Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.2605749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2605821Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2605892Z unimplemented [] 2025-12-04T09:14:58.2606000Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2606191Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2607374Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2607437Z graph_break [] 2025-12-04T09:14:58.2607574Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2607646Z Autotune Choices Stats: 2025-12-04T09:14:58.2609037Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2609282Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2609512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2609863Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2611039Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2612234Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2613448Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2614566Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2615686Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2616792Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2617047Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.2617114Z Autotune Choices Stats: 2025-12-04T09:14:58.2618541Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2619006Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2619404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2619964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2621131Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2622324Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2623482Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2624639Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2625787Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2626936Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2628148Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2629328Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2630504Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2631653Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2631908Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.2632039Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2632108Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2632177Z unimplemented [] 2025-12-04T09:14:58.2632278Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2632462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2633642Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2633706Z graph_break [] 2025-12-04T09:14:58.2633860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2633927Z Autotune Choices Stats: 2025-12-04T09:14:58.2635317Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2635597Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2635852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2636199Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2637327Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2638484Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2639612Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2640717Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2641823Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2642934Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2643185Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.2643252Z Autotune Choices Stats: 2025-12-04T09:14:58.2644723Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2645217Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2645546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2646119Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2647310Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2648464Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2649626Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2650775Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2651931Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2653160Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2654371Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2655556Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2656700Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2657849Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2658101Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.2658229Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2658298Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2658369Z unimplemented [] 2025-12-04T09:14:58.2658472Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2658656Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2659855Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2659921Z graph_break [] 2025-12-04T09:14:58.2660053Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2660119Z Autotune Choices Stats: 2025-12-04T09:14:58.2661537Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.2662038Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2662269Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2662592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2663726Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2664909Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2666047Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2667163Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2668282Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2669391Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2669693Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.2669758Z Autotune Choices Stats: 2025-12-04T09:14:58.2671237Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2671718Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2672053Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2672665Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2673841Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2675001Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2676168Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2677333Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2678489Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2679747Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2680898Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2682090Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2683240Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2684394Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2684657Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.2684794Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2684866Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2684933Z unimplemented [] 2025-12-04T09:14:58.2685038Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2685226Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2686408Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2686469Z graph_break [] 2025-12-04T09:14:58.2686639Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2686706Z Autotune Choices Stats: 2025-12-04T09:14:58.2688139Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2688417Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2688634Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2688966Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2690130Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2691245Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2692431Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2693558Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2694682Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2695833Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2696150Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.2696217Z Autotune Choices Stats: 2025-12-04T09:14:58.2697651Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.2698086Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2698451Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2699009Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2700183Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2701350Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2702513Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2703665Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2704888Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2706089Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2707271Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2708429Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2709580Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2710736Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2710985Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.2711122Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2711191Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2711258Z unimplemented [] 2025-12-04T09:14:58.2711363Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2711549Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2712760Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2712885Z graph_break [] 2025-12-04T09:14:58.2713035Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2713106Z Autotune Choices Stats: 2025-12-04T09:14:58.2714503Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.2714748Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2715002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2715323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2716448Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2717568Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2718686Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2719799Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2720913Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2722121Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2722393Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.2722461Z Autotune Choices Stats: 2025-12-04T09:14:58.2723927Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2724369Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2724704Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2725274Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2726457Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2727612Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2728780Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2729959Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2731170Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2732420Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2733607Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2734768Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2735919Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2737078Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2737328Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.2737465Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2737537Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2737610Z unimplemented [] 2025-12-04T09:14:58.2737749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2737934Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2739172Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2739270Z graph_break [] 2025-12-04T09:14:58.2739406Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2739475Z Autotune Choices Stats: 2025-12-04T09:14:58.2740895Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2741151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2741369Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2741690Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2742831Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2743945Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2745063Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2746178Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2747415Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2748564Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2748824Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.2748892Z Autotune Choices Stats: 2025-12-04T09:14:58.2750367Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2750805Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2751145Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2751706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2752872Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2754026Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2755178Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2756423Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2757573Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2758758Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2759908Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2761065Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2762421Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2763621Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2763949Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.2764089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2764260Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2764334Z unimplemented [] 2025-12-04T09:14:58.2764442Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2764627Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2765815Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2765880Z graph_break [] 2025-12-04T09:14:58.2766015Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2766086Z Autotune Choices Stats: 2025-12-04T09:14:58.2767534Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2767789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2768013Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2768341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2769503Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2770637Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2771759Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2773007Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2774207Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2775370Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2775629Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.2775696Z Autotune Choices Stats: 2025-12-04T09:14:58.2777130Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2777572Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2777903Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2778464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2779648Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2780801Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2782025Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2783215Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2784400Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2785554Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2786715Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2787872Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2789030Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2790214Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2790520Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.2790675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2790746Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2790811Z unimplemented [] 2025-12-04T09:14:58.2790920Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2791109Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2792330Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2792395Z graph_break [] 2025-12-04T09:14:58.2792529Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2792598Z Autotune Choices Stats: 2025-12-04T09:14:58.2793991Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.2794245Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2794468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2794792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2795924Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2797050Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2798170Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2799397Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2800519Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2801678Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2801929Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.2801997Z Autotune Choices Stats: 2025-12-04T09:14:58.2803436Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2803870Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2804201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2804769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2805937Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2807174Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2808371Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2809551Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2810708Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2811859Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2813080Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2814238Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2815395Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2816655Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2816901Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.2817032Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2817101Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2817165Z unimplemented [] 2025-12-04T09:14:58.2817275Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2817463Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2818675Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2818747Z graph_break [] 2025-12-04T09:14:58.2818880Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2818949Z Autotune Choices Stats: 2025-12-04T09:14:58.2820355Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.2820612Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2820833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2821159Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2822290Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2823408Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2824594Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2825747Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2826903Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2828020Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2828276Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.2828343Z Autotune Choices Stats: 2025-12-04T09:14:58.2829781Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2830226Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2830565Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2831125Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2832324Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2833571Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2834765Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2835919Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2837092Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2838245Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2839405Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2840564Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2841777Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2842963Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2843209Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.2843351Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2843458Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2843521Z unimplemented [] 2025-12-04T09:14:58.2843637Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2843822Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2845003Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2845067Z graph_break [] 2025-12-04T09:14:58.2845200Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2845277Z Autotune Choices Stats: 2025-12-04T09:14:58.2846671Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2846921Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2847143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2847466Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2848596Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2849790Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2850936Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2852094Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2853313Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2854437Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2854691Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.2854759Z Autotune Choices Stats: 2025-12-04T09:14:58.2856193Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.2856637Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2856964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2857575Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2858815Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2859976Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2861169Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2862541Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2863730Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2864895Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2866064Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2867291Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2868554Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2869764Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2870010Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.2870145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2870216Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2870279Z unimplemented [] 2025-12-04T09:14:58.2870389Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2870578Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2871761Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2871824Z graph_break [] 2025-12-04T09:14:58.2871952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2872027Z Autotune Choices Stats: 2025-12-04T09:14:58.2873415Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2873665Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2873883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2874222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2875390Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2876580Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2877733Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2878854Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2879977Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2881093Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2881346Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.2881415Z Autotune Choices Stats: 2025-12-04T09:14:58.2882844Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2883284Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2883749Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2884350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2885515Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2886712Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2887872Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2889037Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2890188Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2891338Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2892566Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2893806Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2894958Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2896148Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2896396Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.2896535Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2896605Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2896671Z unimplemented [] 2025-12-04T09:14:58.2896784Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2896972Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2898151Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2898214Z graph_break [] 2025-12-04T09:14:58.2898339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2898418Z Autotune Choices Stats: 2025-12-04T09:14:58.2899819Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2900074Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2900345Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2900669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2901869Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2902994Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2904147Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2905277Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2906406Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2907528Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2907783Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.2907852Z Autotune Choices Stats: 2025-12-04T09:14:58.2909281Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.2909816Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2910141Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2910708Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2911925Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2913091Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2914261Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2915415Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2916589Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2917753Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2918986Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2920185Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2921369Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2922523Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2922773Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.2922905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2922974Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2923036Z unimplemented [] 2025-12-04T09:14:58.2923142Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2923328Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2924514Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2924578Z graph_break [] 2025-12-04T09:14:58.2924705Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2924779Z Autotune Choices Stats: 2025-12-04T09:14:58.2926163Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2926476Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2926730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2927054Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2928180Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2929354Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2930475Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2931598Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2932751Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2933876Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2934122Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.2934189Z Autotune Choices Stats: 2025-12-04T09:14:58.2935661Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.2936167Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2936490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2937058Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2938252Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2939415Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2940574Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2941730Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2942882Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2944261Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2945452Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2946636Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2947791Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2948957Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.2949202Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.2949348Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2949422Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2949485Z unimplemented [] 2025-12-04T09:14:58.2949596Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2949781Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2950969Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2951031Z graph_break [] 2025-12-04T09:14:58.2951160Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2951236Z Autotune Choices Stats: 2025-12-04T09:14:58.2952673Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.2952985Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2953206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2953531Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2954698Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2955823Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2956951Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2958076Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2959201Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2960324Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2960603Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.2960712Z Autotune Choices Stats: 2025-12-04T09:14:58.2962480Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2962943Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2963324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2963906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2965090Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2966266Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2967426Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2968603Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2969805Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2971068Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2972266Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2973512Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2974671Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2975831Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2976082Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.2976222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.2976296Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.2976359Z unimplemented [] 2025-12-04T09:14:58.2976476Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.2976668Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.2977866Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.2978011Z graph_break [] 2025-12-04T09:14:58.2978142Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.2978287Z Autotune Choices Stats: 2025-12-04T09:14:58.2979780Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.2980038Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2980291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2980687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2981819Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2982936Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2984081Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.2985211Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.2986327Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2987527Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2987809Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.2987885Z Autotune Choices Stats: 2025-12-04T09:14:58.2989324Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.2989805Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.2990134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.2990705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.2991876Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.2993035Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2994194Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.2995346Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2996629Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.2997778Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.2999001Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3000149Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3001303Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3002453Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3002701Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.3002836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3002906Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3002969Z unimplemented [] 2025-12-04T09:14:58.3003086Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3003275Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3004529Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3004628Z graph_break [] 2025-12-04T09:14:58.3004754Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3004828Z Autotune Choices Stats: 2025-12-04T09:14:58.3006216Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3006500Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3006719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3007038Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3008165Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3009285Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3010398Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3011516Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3012755Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3013950Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3014195Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.3014274Z Autotune Choices Stats: 2025-12-04T09:14:58.3015773Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3016217Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3016541Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3017130Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3018298Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3019459Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3020605Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3021826Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3023012Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3024192Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3025351Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3026499Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3027651Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3028816Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3029062Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.3029251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3029321Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3029385Z unimplemented [] 2025-12-04T09:14:58.3029495Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3029747Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3030926Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3031007Z graph_break [] 2025-12-04T09:14:58.3031137Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3031214Z Autotune Choices Stats: 2025-12-04T09:14:58.3032635Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3032892Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3033115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3033435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3034575Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3035703Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3036826Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3037945Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3039165Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3040287Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3040536Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.3040645Z Autotune Choices Stats: 2025-12-04T09:14:58.3042068Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3042513Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3042842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3043410Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3044572Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3045733Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3046932Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3048154Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3049344Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3050493Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3051654Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3052887Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3054047Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3055200Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3055550Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.3055685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3055764Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3055829Z unimplemented [] 2025-12-04T09:14:58.3055938Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3056121Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3057293Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3057364Z graph_break [] 2025-12-04T09:14:58.3057492Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3057613Z Autotune Choices Stats: 2025-12-04T09:14:58.3059000Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3059251Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3059481Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3059802Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3060929Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3062253Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3063386Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3064623Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3065781Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3066955Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3067204Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.3067277Z Autotune Choices Stats: 2025-12-04T09:14:58.3068707Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.3069146Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3069472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3070036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3071213Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3072411Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3073628Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3074788Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3075981Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3077132Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3078288Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3079431Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3080591Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3081810Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3082084Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.3082214Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3082291Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3082353Z unimplemented [] 2025-12-04T09:14:58.3082460Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3082645Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3083854Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3083925Z graph_break [] 2025-12-04T09:14:58.3084053Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3084130Z Autotune Choices Stats: 2025-12-04T09:14:58.3085518Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3085768Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3085988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3086304Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3087450Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3088572Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3089750Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3090921Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3092114Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3093309Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3093558Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.3093634Z Autotune Choices Stats: 2025-12-04T09:14:58.3095060Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3095503Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3095833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3096403Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3097567Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3098795Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3100010Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3101235Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3102388Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3103544Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3104702Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3105856Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3107042Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3108262Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3108511Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.3108641Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3108722Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3108788Z unimplemented [] 2025-12-04T09:14:58.3108899Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3109118Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3110291Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3110361Z graph_break [] 2025-12-04T09:14:58.3110487Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3110563Z Autotune Choices Stats: 2025-12-04T09:14:58.3111961Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3112211Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3112440Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3112759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3113896Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3115034Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3116219Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3117335Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3118485Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3119601Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3119850Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.3119924Z Autotune Choices Stats: 2025-12-04T09:14:58.3121342Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.3121804Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3122132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3122707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3123955Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3125143Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3126328Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3127484Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3128703Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3129858Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3131011Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3132158Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3133471Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3134627Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3134906Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.3135039Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3135113Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3135177Z unimplemented [] 2025-12-04T09:14:58.3135281Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3135470Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3136805Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3136879Z graph_break [] 2025-12-04T09:14:58.3137007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3137085Z Autotune Choices Stats: 2025-12-04T09:14:58.3138463Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3138714Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3138945Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3139261Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3140385Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3141597Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3142715Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3143866Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3144986Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3146109Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3146355Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.3146431Z Autotune Choices Stats: 2025-12-04T09:14:58.3147859Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3148297Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3148621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3149256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3150484Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3151674Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3152826Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3153994Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3155147Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3156292Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3157442Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3158669Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3159863Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3161047Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3161299Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.3161428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3161504Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3161566Z unimplemented [] 2025-12-04T09:14:58.3161667Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3162078Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3163287Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3163357Z graph_break [] 2025-12-04T09:14:58.3163494Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3163562Z Autotune Choices Stats: 2025-12-04T09:14:58.3165009Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.3165267Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3165498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3165815Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3167077Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3168249Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3169426Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3170544Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3171654Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3172845Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3173095Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.3173176Z Autotune Choices Stats: 2025-12-04T09:14:58.3174606Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3175089Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3175485Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3176056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3177216Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3178418Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3179570Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3180728Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3181871Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3183022Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3184239Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3185413Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3186612Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3187756Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3188010Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.3188143Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3188220Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3188284Z unimplemented [] 2025-12-04T09:14:58.3188389Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3188578Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3189752Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3189833Z graph_break [] 2025-12-04T09:14:58.3189962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3190034Z Autotune Choices Stats: 2025-12-04T09:14:58.3191433Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.3191714Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3191936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3192319Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3193454Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3194599Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3195717Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3196845Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3197996Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3199115Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3199358Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.3199437Z Autotune Choices Stats: 2025-12-04T09:14:58.3200896Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3201397Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3201725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3202291Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3203492Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3204651Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3205802Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3206958Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3208111Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3209295Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3210506Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3211648Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3212905Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3214059Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3214310Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.3214438Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3214520Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3214583Z unimplemented [] 2025-12-04T09:14:58.3214686Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3214877Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3216053Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3216124Z graph_break [] 2025-12-04T09:14:58.3216252Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3216319Z Autotune Choices Stats: 2025-12-04T09:14:58.3217751Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3218073Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3218297Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3218614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3219742Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3220884Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3222006Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3223129Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3224240Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3225358Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3225600Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:58.3229787Z Autotune Choices Stats: 2025-12-04T09:14:58.3231385Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.3231894Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3232300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3232910Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3234092Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3235252Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3236403Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3237555Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3238712Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3239934Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3241114Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3242292Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3243438Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3244575Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3244838Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:58.3245020Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.3245116Z Traceback (most recent call last): 2025-12-04T09:14:58.3245425Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.3245496Z self.assertTrue( 2025-12-04T09:14:58.3245700Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.3245783Z raise self.failureException(msg) 2025-12-04T09:14:58.3246025Z AssertionError: False is not true : Log file /tmp/tmpkvey3fd3/flex_attention_configs.json was not created 2025-12-04T09:14:58.3246037Z 2025-12-04T09:14:58.3246172Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.3246422Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.3246427Z 2025-12-04T09:14:58.3246598Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.3246771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3246851Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3246955Z unimplemented [] 2025-12-04T09:14:58.3247094Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3248291Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.3248475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3248535Z graph_break [] 2025-12-04T09:14:58.3248674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3249692Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.3249782Z current_size = base.storage().size() 2025-12-04T09:14:58.3249848Z Autotune Choices Stats: 2025-12-04T09:14:58.3251245Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.3251494Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3251717Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3252031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3253244Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3254353Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3255453Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3256677Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3257778Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3258910Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3259162Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.3259230Z Autotune Choices Stats: 2025-12-04T09:14:58.3260658Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.3261093Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3261423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3262306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3263511Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3264761Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3266054Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3267264Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3268479Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3269640Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3270791Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3271946Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3273087Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3274313Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3274602Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.3274738Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3274811Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3274882Z unimplemented [] 2025-12-04T09:14:58.3275003Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3275192Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3276414Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3276477Z graph_break [] 2025-12-04T09:14:58.3276612Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3276680Z Autotune Choices Stats: 2025-12-04T09:14:58.3278087Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3278335Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3278567Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3278884Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3280017Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3281141Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3282326Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3283472Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3284628Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3285744Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3285997Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.3286066Z Autotune Choices Stats: 2025-12-04T09:14:58.3287499Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3287935Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3288281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3288844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3290014Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3291328Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3292571Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3293802Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3294964Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3296118Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3297292Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3298458Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3299675Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3300875Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3301135Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.3301270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3301339Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3301442Z unimplemented [] 2025-12-04T09:14:58.3301549Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3301735Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3302922Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3302983Z graph_break [] 2025-12-04T09:14:58.3303117Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3303184Z Autotune Choices Stats: 2025-12-04T09:14:58.3304628Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3304883Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3305107Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3305431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3306563Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3307751Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3308902Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3310082Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3311209Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3312331Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3312586Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.3312655Z Autotune Choices Stats: 2025-12-04T09:14:58.3314099Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3314539Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3314870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3315432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3316669Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3317864Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3319077Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3320221Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3321375Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3322522Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3323667Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3324843Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3326065Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3327244Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3327493Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.3327631Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3327702Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3327769Z unimplemented [] 2025-12-04T09:14:58.3327875Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3328062Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3329261Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3329322Z graph_break [] 2025-12-04T09:14:58.3329457Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3329524Z Autotune Choices Stats: 2025-12-04T09:14:58.3330933Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3331184Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3331400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3331720Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3332933Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3334118Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3335254Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3336405Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3337526Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3338643Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3338896Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.3338964Z Autotune Choices Stats: 2025-12-04T09:14:58.3340397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3340830Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3341195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3341820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3342992Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3344177Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3345332Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3346484Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3347669Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3348825Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3349968Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3351238Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3352384Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3353568Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3353818Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.3353957Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3354029Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3354097Z unimplemented [] 2025-12-04T09:14:58.3354201Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3354390Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3355577Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3355639Z graph_break [] 2025-12-04T09:14:58.3355772Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3355838Z Autotune Choices Stats: 2025-12-04T09:14:58.3357224Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3357477Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3357695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3358076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3359236Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3360403Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3361559Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3362865Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3363987Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3365095Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3365351Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.3365420Z Autotune Choices Stats: 2025-12-04T09:14:58.3366849Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3367367Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3367808Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3368369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3369549Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3370743Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3371898Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3373100Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3374263Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3375416Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3376632Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3377814Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3378997Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3380186Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3380433Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.3380574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3380640Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3380706Z unimplemented [] 2025-12-04T09:14:58.3380810Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3380995Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3382181Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3382246Z graph_break [] 2025-12-04T09:14:58.3382381Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3382449Z Autotune Choices Stats: 2025-12-04T09:14:58.3383842Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.3384130Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3384430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3384751Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3385878Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3387039Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3388149Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3389273Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3390391Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3391507Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3391763Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.3391828Z Autotune Choices Stats: 2025-12-04T09:14:58.3393504Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3394004Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3394339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3394898Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3396112Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3397274Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3398438Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3399587Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3400749Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3401939Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3403145Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3404337Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3405484Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3406643Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3406892Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.3407030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3407102Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3407165Z unimplemented [] 2025-12-04T09:14:58.3407278Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3407467Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3408648Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3408718Z graph_break [] 2025-12-04T09:14:58.3408851Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3408918Z Autotune Choices Stats: 2025-12-04T09:14:58.3410351Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3410666Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3410885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3411208Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3412448Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3413577Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3414700Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3415827Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3416952Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3418063Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3418367Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.3418438Z Autotune Choices Stats: 2025-12-04T09:14:58.3419913Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.3420377Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3420710Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3421307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3422483Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3423640Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3424806Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3425965Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3427129Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3428418Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3429574Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3430765Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3431913Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3433077Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3433320Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.3433459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3433528Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3433591Z unimplemented [] 2025-12-04T09:14:58.3433704Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3433892Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3435073Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3435167Z graph_break [] 2025-12-04T09:14:58.3435297Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3435364Z Autotune Choices Stats: 2025-12-04T09:14:58.3436804Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.3437092Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3437314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3437636Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3438798Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3439918Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3441038Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3442156Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3443273Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3444426Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3444742Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.3444808Z Autotune Choices Stats: 2025-12-04T09:14:58.3446228Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3446757Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3447085Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3447646Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3448825Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3449990Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3451150Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3452340Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3453559Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3454738Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3455927Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3457086Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3458236Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3459392Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3459637Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.3459770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3459836Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3459899Z unimplemented [] 2025-12-04T09:14:58.3460010Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3460191Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3461403Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3461525Z graph_break [] 2025-12-04T09:14:58.3461660Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3461739Z Autotune Choices Stats: 2025-12-04T09:14:58.3463289Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3463545Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3463838Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3464170Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3465317Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3466443Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3467560Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3468694Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3469813Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3471068Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3471318Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.3471383Z Autotune Choices Stats: 2025-12-04T09:14:58.3472839Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3473282Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3473605Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3474173Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3475344Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3476521Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3477683Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3478903Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3480105Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3481294Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3482447Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3483606Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3484760Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3485920Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3486166Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.3486304Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3486375Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3486473Z unimplemented [] 2025-12-04T09:14:58.3486582Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3486765Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3488016Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3488079Z graph_break [] 2025-12-04T09:14:58.3488221Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3488302Z Autotune Choices Stats: 2025-12-04T09:14:58.3489741Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3489994Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3490212Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3490534Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3491668Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3492851Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3493970Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3495089Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3496290Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3497435Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3497685Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.3497754Z Autotune Choices Stats: 2025-12-04T09:14:58.3499221Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3499668Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3499994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3500561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3501764Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3502930Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3504118Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3505364Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3506519Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3507700Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3508885Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3510039Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3511190Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3512354Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3512636Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.3512821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3512923Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3512990Z unimplemented [] 2025-12-04T09:14:58.3513099Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3513282Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3514467Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3514529Z graph_break [] 2025-12-04T09:14:58.3514654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3514731Z Autotune Choices Stats: 2025-12-04T09:14:58.3516158Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.3516421Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3516637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3516963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3518089Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3519224Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3520333Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3521518Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3522661Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3523816Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3524070Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.3524147Z Autotune Choices Stats: 2025-12-04T09:14:58.3525573Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3526014Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3526338Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3526906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3528073Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3529244Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3530463Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3531680Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3532948Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3534103Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3535264Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3536422Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3537570Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3538756Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3539070Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.3539203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3539271Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3539333Z unimplemented [] 2025-12-04T09:14:58.3539441Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3539622Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3540844Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3540907Z graph_break [] 2025-12-04T09:14:58.3541044Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3541120Z Autotune Choices Stats: 2025-12-04T09:14:58.3542525Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.3542779Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3542997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3543317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3544445Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3545566Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3546731Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3547924Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3549043Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3550197Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3550442Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.3550510Z Autotune Choices Stats: 2025-12-04T09:14:58.3551940Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3552378Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3552747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3553316Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3554477Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3555708Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3556899Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3558127Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3559287Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3560436Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3561590Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3562902Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3564051Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3565377Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3565623Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.3565756Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3565823Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3565884Z unimplemented [] 2025-12-04T09:14:58.3565989Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3566173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3567404Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3567466Z graph_break [] 2025-12-04T09:14:58.3567590Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3567664Z Autotune Choices Stats: 2025-12-04T09:14:58.3569062Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3569311Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3569525Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3569842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3570979Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3572106Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3573397Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3574520Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3575684Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3576803Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3577053Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.3577121Z Autotune Choices Stats: 2025-12-04T09:14:58.3578554Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.3578996Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3579325Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3579888Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3581088Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3582310Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3583501Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3584654Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3585813Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3586962Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3588119Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3589278Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3590483Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3591672Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3593206Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.3593700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3593985Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3594181Z unimplemented [] 2025-12-04T09:14:58.3594382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3594746Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3596185Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3597499Z graph_break [] 2025-12-04T09:14:58.3597743Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3598016Z Autotune Choices Stats: 2025-12-04T09:14:58.3599538Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3601234Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3601784Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3602402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3603929Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3606321Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3608662Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3611012Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3613385Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3615699Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3617148Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.3617536Z Autotune Choices Stats: 2025-12-04T09:14:58.3619087Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3621009Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3621846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3622860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3624749Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3627131Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3629545Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3631915Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3634284Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3636655Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3639018Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3641452Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3643867Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3646268Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3647740Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.3648226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3648507Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3648693Z unimplemented [] 2025-12-04T09:14:58.3648910Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3649276Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3650717Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3652019Z graph_break [] 2025-12-04T09:14:58.3652242Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3652600Z Autotune Choices Stats: 2025-12-04T09:14:58.3654103Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3655803Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3656337Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3656950Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3658563Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3660904Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3663469Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3665790Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3668120Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3670439Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3671872Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.3672258Z Autotune Choices Stats: 2025-12-04T09:14:58.3673807Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.3675729Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3676664Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3677674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3679489Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3681908Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3684292Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3686690Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3689077Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3691488Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3693967Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3696428Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3698834Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3701205Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3702671Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.3703119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3703404Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3703600Z unimplemented [] 2025-12-04T09:14:58.3703802Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3704158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3705597Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3706894Z graph_break [] 2025-12-04T09:14:58.3707117Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3707392Z Autotune Choices Stats: 2025-12-04T09:14:58.3708899Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3710593Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3711177Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3711823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3713376Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3715686Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3718073Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3720387Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3722691Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3724999Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3726446Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.3726831Z Autotune Choices Stats: 2025-12-04T09:14:58.3728368Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.3730415Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3731246Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3732207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3734098Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3736478Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3738853Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3741221Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3743599Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3745968Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3748446Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3750819Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3753233Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3755605Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3757077Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.3757529Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3757806Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3757991Z unimplemented [] 2025-12-04T09:14:58.3758195Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3758558Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3759997Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3761306Z graph_break [] 2025-12-04T09:14:58.3761530Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3761979Z Autotune Choices Stats: 2025-12-04T09:14:58.3763514Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3765435Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3765980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3766586Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3768112Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3770481Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3772847Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3775166Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3777472Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3779786Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3781212Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.3781599Z Autotune Choices Stats: 2025-12-04T09:14:58.3783216Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3785174Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3786006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3786966Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3788844Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3791229Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3793616Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3796025Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3798413Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3800851Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3803285Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3805687Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3808065Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3810442Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3811914Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.3812402Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3812717Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3812914Z unimplemented [] 2025-12-04T09:14:58.3813113Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3813478Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3814918Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3816214Z graph_break [] 2025-12-04T09:14:58.3816443Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3816718Z Autotune Choices Stats: 2025-12-04T09:14:58.3818301Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.3820033Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3820571Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3821178Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3822752Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3825055Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3827352Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3829652Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3831956Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3834267Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3835737Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.3836185Z Autotune Choices Stats: 2025-12-04T09:14:58.3837727Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3839655Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3840527Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3841488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3843332Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3845717Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3848103Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3850500Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3852968Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3855409Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3857814Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3860211Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3862746Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3865127Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3866594Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.3867055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3867326Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3867521Z unimplemented [] 2025-12-04T09:14:58.3867728Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3868104Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3869535Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3870915Z graph_break [] 2025-12-04T09:14:58.3871237Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3871520Z Autotune Choices Stats: 2025-12-04T09:14:58.3873020Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3874717Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3875259Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3875924Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3877443Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3879755Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3882054Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3884360Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3886661Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3889078Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3890593Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.3890974Z Autotune Choices Stats: 2025-12-04T09:14:58.3892544Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3894528Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3895357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3896330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3898134Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3900514Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3902890Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3905261Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3907729Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3910094Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3912513Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3914883Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3917269Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3919644Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3921115Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.3921571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3921853Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3922040Z unimplemented [] 2025-12-04T09:14:58.3922244Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3922605Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3924114Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3925478Z graph_break [] 2025-12-04T09:14:58.3925698Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3925978Z Autotune Choices Stats: 2025-12-04T09:14:58.3927491Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3929230Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3929774Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3930382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3931903Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3934267Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3936581Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3938884Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3941230Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3943599Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3945032Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.3945432Z Autotune Choices Stats: 2025-12-04T09:14:58.3947009Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.3948936Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3949771Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3950731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3952528Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3954911Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3957278Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3959729Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3962314Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3964764Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.3967137Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.3969543Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.3971909Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3974342Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.3975815Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.3976333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.3976617Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.3976808Z unimplemented [] 2025-12-04T09:14:58.3977059Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.3977463Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.3978907Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.3980205Z graph_break [] 2025-12-04T09:14:58.3980429Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.3980698Z Autotune Choices Stats: 2025-12-04T09:14:58.3982229Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.3983931Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.3984467Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.3985076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.3986607Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3988928Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3991238Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3993533Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.3995933Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.3998231Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.3999695Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.4000073Z Autotune Choices Stats: 2025-12-04T09:14:58.4001612Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.4003540Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4004373Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4005339Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4007150Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4009535Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4012032Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4014491Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4016903Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4019285Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4021656Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4024016Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4026385Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4028754Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4030333Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.4030783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4031062Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4031248Z unimplemented [] 2025-12-04T09:14:58.4031447Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4031807Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4033241Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4034590Z graph_break [] 2025-12-04T09:14:58.4034855Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4035131Z Autotune Choices Stats: 2025-12-04T09:14:58.4036626Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4038311Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4038843Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4039452Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4040973Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4043277Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4045588Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4047965Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4050322Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4052737Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4054175Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.4054562Z Autotune Choices Stats: 2025-12-04T09:14:58.4056099Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4058026Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4058848Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4059807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4061612Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4064232Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4066712Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4069135Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4071548Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4073918Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4076299Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4078675Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4081057Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4083539Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4085042Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.4085500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4085775Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4085983Z unimplemented [] 2025-12-04T09:14:58.4086182Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4086544Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4088035Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4089357Z graph_break [] 2025-12-04T09:14:58.4089576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4089849Z Autotune Choices Stats: 2025-12-04T09:14:58.4091351Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4093100Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4093632Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4094232Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4095762Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4098061Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4100738Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4103085Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4105412Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4107713Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4109145Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.4109528Z Autotune Choices Stats: 2025-12-04T09:14:58.4111073Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.4112989Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4113821Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4114783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4116580Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4119069Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4121437Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4123841Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4126205Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4128573Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4130952Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4133373Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4135812Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4138219Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4139692Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.4140144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4140432Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4140627Z unimplemented [] 2025-12-04T09:14:58.4140870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4141231Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4142672Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4143970Z graph_break [] 2025-12-04T09:14:58.4144193Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4144466Z Autotune Choices Stats: 2025-12-04T09:14:58.4145974Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4147685Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4148223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4148835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4150359Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4152699Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4155087Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4157406Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4159705Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4162168Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4163609Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.4163986Z Autotune Choices Stats: 2025-12-04T09:14:58.4165522Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4167445Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4168275Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4169236Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4171167Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4173647Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4176066Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4178433Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4180804Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4183171Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4185541Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4187935Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4190376Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4192766Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4194266Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.4194718Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4194998Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4195186Z unimplemented [] 2025-12-04T09:14:58.4195386Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4195748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4197190Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4198502Z graph_break [] 2025-12-04T09:14:58.4198724Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4198997Z Autotune Choices Stats: 2025-12-04T09:14:58.4205232Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.4206990Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4207540Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4208160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4209703Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4212211Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4214648Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4217004Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4219310Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4221619Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4223053Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.4223463Z Autotune Choices Stats: 2025-12-04T09:14:58.4225027Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4226954Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4227783Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4228844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4230725Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4233160Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4235547Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4237919Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4240297Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4242670Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4245033Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4247481Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4249882Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4252333Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4253848Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.4256092Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4256393Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4256590Z unimplemented [] 2025-12-04T09:14:58.4256815Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4257193Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4258645Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4259941Z graph_break [] 2025-12-04T09:14:58.4260174Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4260482Z Autotune Choices Stats: 2025-12-04T09:14:58.4262232Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.4263963Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4264512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4265129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4266749Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4269089Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4271416Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4273722Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4276098Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4278394Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4279835Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.4280220Z Autotune Choices Stats: 2025-12-04T09:14:58.4281773Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4283704Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4284623Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4285581Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4287382Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4289805Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4292185Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4294688Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4297052Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4299416Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4301811Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4304188Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4306573Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4308941Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4310451Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.4310916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4311204Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4311393Z unimplemented [] 2025-12-04T09:14:58.4311600Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4311965Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4313403Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4314705Z graph_break [] 2025-12-04T09:14:58.4314933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4315213Z Autotune Choices Stats: 2025-12-04T09:14:58.4316724Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4318430Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4319013Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4319664Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4321189Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4323518Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4325800Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4328141Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4330446Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4332812Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4334240Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:58.4334622Z Autotune Choices Stats: 2025-12-04T09:14:58.4336206Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.4338168Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4339013Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4339974Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4341812Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4344199Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4346608Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4348984Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4351369Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4353769Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4356171Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4357368Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4358520Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4359741Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4359996Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:58.4360129Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4360208Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4360269Z unimplemented [] 2025-12-04T09:14:58.4360373Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4360567Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4361743Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4362028Z graph_break [] 2025-12-04T09:14:58.4362174Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4362242Z Autotune Choices Stats: 2025-12-04T09:14:58.4363732Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4364031Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4364257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4364578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4365728Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4366891Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4368062Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4369188Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4370315Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4371441Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4371691Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:58.4371759Z Autotune Choices Stats: 2025-12-04T09:14:58.4373290Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.4373764Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4374093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4374689Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4375864Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4377066Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4378223Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4379388Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4380537Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4381764Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4382933Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4384121Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4385281Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4386463Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4386721Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:58.4386895Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.4386978Z Traceback (most recent call last): 2025-12-04T09:14:58.4387285Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.4387350Z self.assertTrue( 2025-12-04T09:14:58.4387558Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.4387640Z raise self.failureException(msg) 2025-12-04T09:14:58.4387878Z AssertionError: False is not true : Log file /tmp/tmpsch_rjii/flex_attention_configs.json was not created 2025-12-04T09:14:58.4387884Z 2025-12-04T09:14:58.4388024Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.4388276Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.4388282Z 2025-12-04T09:14:58.4388451Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.4388585Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4388697Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4388766Z unimplemented [] 2025-12-04T09:14:58.4388932Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4390137Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.4390324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4390383Z graph_break [] 2025-12-04T09:14:58.4390520Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4391547Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.4391636Z current_size = base.storage().size() 2025-12-04T09:14:58.4391702Z Autotune Choices Stats: 2025-12-04T09:14:58.4393105Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.4393392Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4393612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4393933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4395060Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4396177Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4397321Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4398463Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4399585Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4400728Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4400982Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.4401082Z Autotune Choices Stats: 2025-12-04T09:14:58.4402510Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.4402947Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4403281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4403845Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4405017Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4406211Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4407408Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4408595Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4409751Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4410940Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4412086Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4413275Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4414414Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4415630Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4415875Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.4416012Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4416080Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4416146Z unimplemented [] 2025-12-04T09:14:58.4416250Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4416439Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4417647Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4417708Z graph_break [] 2025-12-04T09:14:58.4417853Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4417958Z Autotune Choices Stats: 2025-12-04T09:14:58.4419351Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4419595Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4419813Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4420133Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4421262Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4422381Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4423595Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4424740Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4425892Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4427004Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4427303Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.4427369Z Autotune Choices Stats: 2025-12-04T09:14:58.4428817Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4429253Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4429588Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4430147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4431354Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4432542Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4433726Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4434870Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4436058Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4437205Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4438355Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4439515Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4440692Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4441876Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4442122Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.4442262Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4442365Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4442435Z unimplemented [] 2025-12-04T09:14:58.4442541Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4442727Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4443907Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4444002Z graph_break [] 2025-12-04T09:14:58.4444144Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4444215Z Autotune Choices Stats: 2025-12-04T09:14:58.4445596Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4445848Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4446068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4446397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4447525Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4448675Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4449819Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4450972Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4452088Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4453324Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4453576Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.4453643Z Autotune Choices Stats: 2025-12-04T09:14:58.4455067Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4455499Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4455827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4456387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4457591Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4458777Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4459964Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4461108Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4462457Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4463605Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4464759Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4465970Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4467185Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4468383Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4468640Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.4468769Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4468838Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4468908Z unimplemented [] 2025-12-04T09:14:58.4469010Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4469240Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4470424Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4470487Z graph_break [] 2025-12-04T09:14:58.4470617Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4470687Z Autotune Choices Stats: 2025-12-04T09:14:58.4472082Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4472329Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4472550Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4472864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4474040Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4475216Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4476342Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4477496Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4478648Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4479764Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4480014Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.4480079Z Autotune Choices Stats: 2025-12-04T09:14:58.4481525Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4481961Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4482291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4482917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4484081Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4485281Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4486445Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4487636Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4488779Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4489940Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4491080Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4492330Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4493479Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4494697Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4494942Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.4495101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4495167Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4495232Z unimplemented [] 2025-12-04T09:14:58.4495335Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4495522Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4496704Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4496765Z graph_break [] 2025-12-04T09:14:58.4496895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4496961Z Autotune Choices Stats: 2025-12-04T09:14:58.4498345Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4498589Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4498811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4499134Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4500342Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4501461Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4502617Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4503730Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4504877Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4505987Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4506239Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.4506305Z Autotune Choices Stats: 2025-12-04T09:14:58.4507743Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4508214Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4508576Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4509144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4510364Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4511518Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4512701Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4513848Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4515000Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4516154Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4517337Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4518540Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4519721Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4520874Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4521157Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.4521289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4521355Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4521422Z unimplemented [] 2025-12-04T09:14:58.4521525Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4521708Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4522890Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4522952Z graph_break [] 2025-12-04T09:14:58.4523084Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4523149Z Autotune Choices Stats: 2025-12-04T09:14:58.4524536Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.4524777Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4525068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4525397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4526521Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4527676Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4528796Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4529943Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4531065Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4532180Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4532467Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.4532534Z Autotune Choices Stats: 2025-12-04T09:14:58.4533999Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4534466Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4534800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4535361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4536563Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4537714Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4538905Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4540051Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4541209Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4542392Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4543564Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4544751Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4545898Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4547083Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4547329Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.4547457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4547523Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4547590Z unimplemented [] 2025-12-04T09:14:58.4547689Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4547869Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4549053Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4549114Z graph_break [] 2025-12-04T09:14:58.4549243Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4549308Z Autotune Choices Stats: 2025-12-04T09:14:58.4550760Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4551036Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4551251Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4551581Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4552749Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4553866Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4555020Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4556133Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4557249Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4558357Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4558604Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.4558668Z Autotune Choices Stats: 2025-12-04T09:14:58.4560167Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.4560600Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4560930Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4561522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4562836Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4564058Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4565213Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4566374Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4567576Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4568786Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4569942Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4571142Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4572357Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4573513Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4573759Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.4573894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4573965Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4574034Z unimplemented [] 2025-12-04T09:14:58.4574136Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4574330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4575515Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4575575Z graph_break [] 2025-12-04T09:14:58.4575706Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4575770Z Autotune Choices Stats: 2025-12-04T09:14:58.4577232Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.4577479Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4577696Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4578023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4579191Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4580312Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4581481Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4582591Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4583709Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4584851Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4585132Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.4585199Z Autotune Choices Stats: 2025-12-04T09:14:58.4586632Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4587097Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4587430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4587990Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4589211Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4590364Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4591523Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4592671Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4593856Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4595040Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4596214Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4597366Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4598546Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4599699Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4599943Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.4600076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4600143Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4600212Z unimplemented [] 2025-12-04T09:14:58.4600316Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4600495Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4601708Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4601799Z graph_break [] 2025-12-04T09:14:58.4601931Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4601996Z Autotune Choices Stats: 2025-12-04T09:14:58.4603395Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4603642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4603891Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4604210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4605335Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4606482Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4607590Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4608711Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4609831Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4611020Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4611269Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.4611335Z Autotune Choices Stats: 2025-12-04T09:14:58.4612865Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4613305Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4613636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4614229Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4615402Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4616556Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4617712Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4618892Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4620080Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4621260Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4622405Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4623587Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4624734Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4625886Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4626127Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.4626259Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4626327Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4626393Z unimplemented [] 2025-12-04T09:14:58.4626494Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4626673Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4627965Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4628025Z graph_break [] 2025-12-04T09:14:58.4628154Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4628221Z Autotune Choices Stats: 2025-12-04T09:14:58.4629651Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4629897Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4630113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4630465Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4631590Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4632701Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4633815Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4634931Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4636082Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4637218Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4637465Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.4637531Z Autotune Choices Stats: 2025-12-04T09:14:58.4639001Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4639463Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4639797Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4640353Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4641524Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4642671Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4643851Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4645048Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4646244Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4647399Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4648573Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4649731Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4650877Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4652028Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4652270Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.4652531Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4652602Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4652672Z unimplemented [] 2025-12-04T09:14:58.4652774Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4652954Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4654139Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4654200Z graph_break [] 2025-12-04T09:14:58.4654334Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4654399Z Autotune Choices Stats: 2025-12-04T09:14:58.4655826Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.4656106Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4656325Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4656648Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4657775Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4658909Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4660025Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4661194Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4662511Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4663684Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4663939Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.4664006Z Autotune Choices Stats: 2025-12-04T09:14:58.4665443Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4665924Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4666253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4666813Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4667986Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4669150Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4670395Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4671545Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4672729Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4673881Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4675271Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4676425Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4677573Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4678786Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4679068Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.4679203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4679272Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4679333Z unimplemented [] 2025-12-04T09:14:58.4679438Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4679624Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4680835Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4680896Z graph_break [] 2025-12-04T09:14:58.4681027Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4681090Z Autotune Choices Stats: 2025-12-04T09:14:58.4682490Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.4682774Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4682990Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4683308Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4684430Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4685546Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4686689Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4687842Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4688989Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4690104Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4690380Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.4690446Z Autotune Choices Stats: 2025-12-04T09:14:58.4691870Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4692341Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4692683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4693248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4694412Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4695606Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4696792Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4697975Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4699140Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4700321Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4701470Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4702629Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4703805Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4704993Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4705234Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.4705367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4705436Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4705498Z unimplemented [] 2025-12-04T09:14:58.4705607Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4705832Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4707019Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4707129Z graph_break [] 2025-12-04T09:14:58.4707263Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4707339Z Autotune Choices Stats: 2025-12-04T09:14:58.4708731Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4708984Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4709200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4709524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4710651Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4711774Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4712954Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4714072Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4715222Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4716338Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4716618Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.4716684Z Autotune Choices Stats: 2025-12-04T09:14:58.4718109Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.4718547Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4718884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4719449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4720647Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4721841Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4723048Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4724202Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4725392Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4726536Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4727693Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4728844Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4730027Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4731215Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4731461Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.4731626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4731694Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4731756Z unimplemented [] 2025-12-04T09:14:58.4731863Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4732045Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4733267Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4733364Z graph_break [] 2025-12-04T09:14:58.4733491Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4733561Z Autotune Choices Stats: 2025-12-04T09:14:58.4734944Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4735188Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4735409Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4735741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4736861Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4738010Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4739160Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4740321Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4741438Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4742589Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4742839Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.4742904Z Autotune Choices Stats: 2025-12-04T09:14:58.4744329Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4744768Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4745091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4745653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4746889Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4748047Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4749235Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4750414Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4751566Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4752714Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4753865Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4755047Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4756230Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4757416Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4757660Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.4757793Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4757860Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4757955Z unimplemented [] 2025-12-04T09:14:58.4758062Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4758243Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4759427Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4759487Z graph_break [] 2025-12-04T09:14:58.4759613Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4759684Z Autotune Choices Stats: 2025-12-04T09:14:58.4761075Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4761322Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4761541Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4762025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4763278Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4764451Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4765608Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4766731Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4767887Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4769016Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4769267Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.4769335Z Autotune Choices Stats: 2025-12-04T09:14:58.4770766Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.4771205Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4771565Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4772182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4773403Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4774604Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4775766Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4776955Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4778106Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4779266Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4780452Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4781639Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4782820Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4783973Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4784256Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.4784396Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4784469Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4784531Z unimplemented [] 2025-12-04T09:14:58.4784643Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4784833Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4786013Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4786074Z graph_break [] 2025-12-04T09:14:58.4786205Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4786280Z Autotune Choices Stats: 2025-12-04T09:14:58.4787668Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4787919Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4788139Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4788529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4789666Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4790819Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4791938Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4793089Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4794200Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4795318Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4795566Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.4795630Z Autotune Choices Stats: 2025-12-04T09:14:58.4797091Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.4797563Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4797898Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4798465Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4799660Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4800819Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4802006Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4803156Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4804312Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4805455Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4806707Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4807861Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4809039Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4810195Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4810472Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.4810605Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4810675Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4810735Z unimplemented [] 2025-12-04T09:14:58.4810842Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4811025Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4812211Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4812271Z graph_break [] 2025-12-04T09:14:58.4812436Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4812515Z Autotune Choices Stats: 2025-12-04T09:14:58.4813911Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4814236Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4814453Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4814772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4815903Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4817051Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4818170Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4819325Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4820441Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4821559Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4821799Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.4821873Z Autotune Choices Stats: 2025-12-04T09:14:58.4823327Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4823795Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4824119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4824716Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4825895Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4827091Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4828251Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4829404Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4830561Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4831744Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4832928Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4834113Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4835256Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4836460Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4836702Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.4836837Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4836906Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4836968Z unimplemented [] 2025-12-04T09:14:58.4837078Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4837265Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4838444Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4838503Z graph_break [] 2025-12-04T09:14:58.4838631Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4838704Z Autotune Choices Stats: 2025-12-04T09:14:58.4840141Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.4840422Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4840638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4840964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4842124Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4843251Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4844399Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4845521Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4846642Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4847760Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4848072Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.4848144Z Autotune Choices Stats: 2025-12-04T09:14:58.4849581Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4850019Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4850378Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4850943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4852106Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4853403Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4854569Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4855720Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4856919Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4858099Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4859286Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4860435Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4861636Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4862950Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4863211Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.4863353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4863423Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4863484Z unimplemented [] 2025-12-04T09:14:58.4863590Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4863775Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4864955Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4865089Z graph_break [] 2025-12-04T09:14:58.4865262Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4865335Z Autotune Choices Stats: 2025-12-04T09:14:58.4866726Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4866973Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4867190Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4867580Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4868730Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4869891Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4871000Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4872122Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4873235Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4874388Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4874659Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.4874731Z Autotune Choices Stats: 2025-12-04T09:14:58.4876193Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4876633Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4876956Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4877552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4878716Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4879877Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4881028Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4882211Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4883406Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4884552Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4885747Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4886927Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4888080Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4889238Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4889482Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.4889615Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4889684Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4889746Z unimplemented [] 2025-12-04T09:14:58.4889853Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4890038Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4891245Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4891341Z graph_break [] 2025-12-04T09:14:58.4891468Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4891539Z Autotune Choices Stats: 2025-12-04T09:14:58.4893018Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4893270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4893488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4893807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4895012Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4896136Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4897255Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4898369Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4899516Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4900687Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4900934Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.4901018Z Autotune Choices Stats: 2025-12-04T09:14:58.4902479Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4902919Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4903277Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4903840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4905002Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4906167Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4907315Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4908504Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4909684Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4910864Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4912017Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4913200Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4914351Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4915505Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4915750Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.4915879Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4915950Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4916045Z unimplemented [] 2025-12-04T09:14:58.4916181Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4916368Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4917534Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4917602Z graph_break [] 2025-12-04T09:14:58.4917727Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4917794Z Autotune Choices Stats: 2025-12-04T09:14:58.4919210Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4919461Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4919708Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4920023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4921158Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4922278Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4923407Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4924562Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4925709Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4926827Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4927104Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.4927175Z Autotune Choices Stats: 2025-12-04T09:14:58.4928590Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.4929063Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4929382Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4929943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4931110Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4932274Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4933521Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4934712Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4935914Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4937066Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4938255Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4939399Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4940556Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4941706Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4942018Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.4942147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4942222Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4942285Z unimplemented [] 2025-12-04T09:14:58.4942391Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4942576Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4943746Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4943815Z graph_break [] 2025-12-04T09:14:58.4943974Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4944045Z Autotune Choices Stats: 2025-12-04T09:14:58.4945432Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4945713Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4945932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4946245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4947380Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4948503Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4949612Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4950800Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4951914Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4953088Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4953331Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.4953403Z Autotune Choices Stats: 2025-12-04T09:14:58.4954864Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.4955302Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4955624Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4956195Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4957360Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4958546Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4959723Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4960917Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4962265Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4963552Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4964718Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4965874Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4967046Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4968257Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4968551Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.4968698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4968769Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4968830Z unimplemented [] 2025-12-04T09:14:58.4968942Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4969128Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4970360Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4970430Z graph_break [] 2025-12-04T09:14:58.4970560Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4970634Z Autotune Choices Stats: 2025-12-04T09:14:58.4972058Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4972349Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4972585Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4972914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4974050Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4975170Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4976322Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.4977480Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.4978623Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4979749Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4980023Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.4980100Z Autotune Choices Stats: 2025-12-04T09:14:58.4981535Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.4981982Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4982314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4982878Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.4984039Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4985268Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4986417Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4987604Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4988752Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4989937Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.4991096Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.4992244Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.4993432Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.4994610Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.4994853Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.4994983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.4995059Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.4995121Z unimplemented [] 2025-12-04T09:14:58.4995287Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.4995480Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.4996654Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.4996761Z graph_break [] 2025-12-04T09:14:58.4996885Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.4996959Z Autotune Choices Stats: 2025-12-04T09:14:58.4998345Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.4998592Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.4998810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.4999137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5000271Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5001419Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5002560Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5003711Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5004823Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5005972Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5006220Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.5006301Z Autotune Choices Stats: 2025-12-04T09:14:58.5007729Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5008175Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5008500Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5009072Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5010271Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5011457Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5012686Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5013845Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5015034Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5016179Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5017338Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5018518Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5019710Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5020899Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5021147Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.5021275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5021350Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5021412Z unimplemented [] 2025-12-04T09:14:58.5021519Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5021717Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5022932Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5023000Z graph_break [] 2025-12-04T09:14:58.5023126Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5023200Z Autotune Choices Stats: 2025-12-04T09:14:58.5024587Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.5024836Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5025055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5025367Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5026530Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5027699Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5028825Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5030013Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5031124Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5032277Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5032519Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.5032591Z Autotune Choices Stats: 2025-12-04T09:14:58.5034019Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5034470Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5034799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5035430Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5036597Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5037788Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5038947Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5040136Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5041285Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5042431Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5043586Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5044795Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5045947Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5047141Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5047389Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.5047548Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5047622Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5047683Z unimplemented [] 2025-12-04T09:14:58.5047788Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5047977Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5049150Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5049215Z graph_break [] 2025-12-04T09:14:58.5049341Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5049411Z Autotune Choices Stats: 2025-12-04T09:14:58.5050793Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.5051042Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5051258Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5051573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5052780Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5053920Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5055071Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5056185Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5057326Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5058443Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5058686Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.5058760Z Autotune Choices Stats: 2025-12-04T09:14:58.5060196Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5060695Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5061052Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5061613Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5062935Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5064153Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5065371Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5066531Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5067689Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5068840Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5070037Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5071223Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5072411Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5073560Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5073840Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.5073970Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5074046Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5074107Z unimplemented [] 2025-12-04T09:14:58.5074206Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5074394Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5075563Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5075631Z graph_break [] 2025-12-04T09:14:58.5075757Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5075826Z Autotune Choices Stats: 2025-12-04T09:14:58.5077228Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5077479Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5077764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5078079Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5079223Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5080364Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5081482Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5082636Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5083745Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5084870Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5085113Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:58.5085185Z Autotune Choices Stats: 2025-12-04T09:14:58.5086645Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.5087116Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5087439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5088004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5089209Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5090365Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5091569Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5092792Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5093951Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5095136Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5096319Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5097499Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5098655Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5099841Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5100095Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:58.5100225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5100299Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5100361Z unimplemented [] 2025-12-04T09:14:58.5100465Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5100653Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5101829Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5101896Z graph_break [] 2025-12-04T09:14:58.5102020Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5102089Z Autotune Choices Stats: 2025-12-04T09:14:58.5103512Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5103796Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5104011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5104324Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5105485Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5106597Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5107743Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5108867Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5109981Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5111095Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5111335Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:58.5111404Z Autotune Choices Stats: 2025-12-04T09:14:58.5112897Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.5113333Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5113670Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5114271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5115436Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5116764Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5117918Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5119076Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5120229Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5121446Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5122601Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5123799Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5124952Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5126126Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5126371Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:58.5126497Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5126572Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5126636Z unimplemented [] 2025-12-04T09:14:58.5126736Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5126925Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5128100Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5128165Z graph_break [] 2025-12-04T09:14:58.5128292Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5128356Z Autotune Choices Stats: 2025-12-04T09:14:58.5129821Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5130066Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5130280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5130596Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5131764Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5132920Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5134085Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5135204Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5136322Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5137470Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5137742Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:58.5137817Z Autotune Choices Stats: 2025-12-04T09:14:58.5139263Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.5139737Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5140064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5140628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5141829Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5142986Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5144135Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5145293Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5146471Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5147654Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5148839Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5149995Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5151184Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5152328Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5152581Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:58.5152754Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.5152838Z Traceback (most recent call last): 2025-12-04T09:14:58.5153136Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.5153199Z self.assertTrue( 2025-12-04T09:14:58.5153399Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.5153478Z raise self.failureException(msg) 2025-12-04T09:14:58.5153723Z AssertionError: False is not true : Log file /tmp/tmp7i52oz8u/flex_attention_configs.json was not created 2025-12-04T09:14:58.5153727Z 2025-12-04T09:14:58.5153861Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.5154200Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.5154206Z 2025-12-04T09:14:58.5154375Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.5154503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5154577Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5154638Z unimplemented [] 2025-12-04T09:14:58.5154740Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5155927Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.5156152Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5156223Z graph_break [] 2025-12-04T09:14:58.5156350Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5157337Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.5157454Z current_size = base.storage().size() 2025-12-04T09:14:58.5157518Z Autotune Choices Stats: 2025-12-04T09:14:58.5158922Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.5159169Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5159391Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5159704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5160843Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5162135Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5163427Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5164548Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5165712Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5166829Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5167130Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.5167198Z Autotune Choices Stats: 2025-12-04T09:14:58.5168635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.5169071Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5169404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5169963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5171158Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5172378Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5173563Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5174731Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5175910Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5177062Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5178210Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5179362Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5180547Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5181720Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5181971Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.5182133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5182206Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5182274Z unimplemented [] 2025-12-04T09:14:58.5182375Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5182560Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5183750Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5183845Z graph_break [] 2025-12-04T09:14:58.5183980Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5184045Z Autotune Choices Stats: 2025-12-04T09:14:58.5185433Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5185673Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5185900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5186214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5187338Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5188501Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5189653Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5190794Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5191914Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5193064Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5193316Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.5193381Z Autotune Choices Stats: 2025-12-04T09:14:58.5194811Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5195245Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5195578Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5196141Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5197374Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5198523Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5199717Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5200867Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5202042Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5203190Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5204332Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5209959Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5211227Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5212517Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5212804Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.5212947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5213022Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5213094Z unimplemented [] 2025-12-04T09:14:58.5213240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5213433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5214626Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5214691Z graph_break [] 2025-12-04T09:14:58.5214831Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5214902Z Autotune Choices Stats: 2025-12-04T09:14:58.5216312Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5216567Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5216793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5217110Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5218296Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5219470Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5220631Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5221757Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5222932Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5224050Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5224308Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.5224378Z Autotune Choices Stats: 2025-12-04T09:14:58.5225819Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5226255Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5226637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5227235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5228417Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5229609Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5230764Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5231943Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5233093Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5234248Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5235424Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5236610Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5237757Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5238937Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5239222Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.5239363Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5239434Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5239512Z unimplemented [] 2025-12-04T09:14:58.5239628Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5239819Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5241003Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5241067Z graph_break [] 2025-12-04T09:14:58.5241203Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5241268Z Autotune Choices Stats: 2025-12-04T09:14:58.5242661Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5242909Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5243134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5243457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5244656Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5245774Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5246930Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5248046Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5249192Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5250302Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5250558Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.5250628Z Autotune Choices Stats: 2025-12-04T09:14:58.5252051Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5252600Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5252936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5253496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5254724Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5255884Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5257075Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5258219Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5259370Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5260524Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5261709Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5263126Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5264365Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5265532Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5265834Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.5265976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5266048Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5266116Z unimplemented [] 2025-12-04T09:14:58.5266221Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5266408Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5267591Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5267655Z graph_break [] 2025-12-04T09:14:58.5267792Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5267859Z Autotune Choices Stats: 2025-12-04T09:14:58.5269260Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5269558Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5269828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5270151Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5271288Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5272442Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5273575Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5274735Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5275846Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5276959Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5277215Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.5277285Z Autotune Choices Stats: 2025-12-04T09:14:58.5278757Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5279226Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5279560Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5280133Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5281342Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5282496Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5283684Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5284833Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5285988Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5287181Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5288378Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5289560Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5290711Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5291895Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5292142Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.5292276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5292384Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5292452Z unimplemented [] 2025-12-04T09:14:58.5292555Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5292741Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5293928Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5293987Z graph_break [] 2025-12-04T09:14:58.5294119Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5294185Z Autotune Choices Stats: 2025-12-04T09:14:58.5295608Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.5295901Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5296123Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5296446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5297617Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5298737Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5299887Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5301006Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5302135Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5303246Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5303498Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.5303642Z Autotune Choices Stats: 2025-12-04T09:14:58.5305119Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5305558Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5305924Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5306485Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5307655Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5308853Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5310011Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5311169Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5312358Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5313547Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5314724Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5315882Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5317063Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5318220Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5318469Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.5318606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5318678Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5318746Z unimplemented [] 2025-12-04T09:14:58.5318851Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5319036Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5320221Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5320283Z graph_break [] 2025-12-04T09:14:58.5320416Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5320540Z Autotune Choices Stats: 2025-12-04T09:14:58.5321980Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5322234Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5322453Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5322779Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5323940Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5325093Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5326209Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5327334Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5328446Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5329592Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5329874Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.5329941Z Autotune Choices Stats: 2025-12-04T09:14:58.5331369Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.5331835Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5332169Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5332795Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5334004Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5335156Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5336319Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5337468Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5338690Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5339852Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5341035Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5342190Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5343375Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5344532Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5344780Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.5344916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5344988Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5345052Z unimplemented [] 2025-12-04T09:14:58.5345161Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5345347Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5346567Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5346659Z graph_break [] 2025-12-04T09:14:58.5346795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5346861Z Autotune Choices Stats: 2025-12-04T09:14:58.5348244Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.5348538Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5348760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5349085Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5350216Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5351388Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5352500Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5353617Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5354758Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5355913Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5356161Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.5356226Z Autotune Choices Stats: 2025-12-04T09:14:58.5357694Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5358129Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5358491Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5359057Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5360228Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5361385Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5362719Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5363948Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5365171Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5366367Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5367520Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5368718Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5369874Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5371031Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5371276Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.5371417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5371486Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5371550Z unimplemented [] 2025-12-04T09:14:58.5371659Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5371912Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5373144Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5373209Z graph_break [] 2025-12-04T09:14:58.5373348Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5373415Z Autotune Choices Stats: 2025-12-04T09:14:58.5374842Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5375094Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5375313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5375666Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5376802Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5377922Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5379037Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5380155Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5381335Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5382445Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5382696Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.5382812Z Autotune Choices Stats: 2025-12-04T09:14:58.5384241Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5384724Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5385062Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5385630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5386791Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5387966Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5389157Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5390357Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5391559Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5392701Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5393890Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5395048Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5396197Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5397352Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5397597Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.5397800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5397874Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5397940Z unimplemented [] 2025-12-04T09:14:58.5398061Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5398248Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5399430Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5399494Z graph_break [] 2025-12-04T09:14:58.5399620Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5399727Z Autotune Choices Stats: 2025-12-04T09:14:58.5401111Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5401395Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5401616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5401941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5403065Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5404181Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5405288Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5406445Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5407592Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5408735Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5408984Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.5409053Z Autotune Choices Stats: 2025-12-04T09:14:58.5410482Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5410969Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5411292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5411852Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5413052Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5414203Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5415439Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5416586Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5417770Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5418921Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5420106Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5421267Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5422420Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5423602Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5423874Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.5424007Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5424074Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5424135Z unimplemented [] 2025-12-04T09:14:58.5424242Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5424429Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5425646Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5425709Z graph_break [] 2025-12-04T09:14:58.5425836Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5425908Z Autotune Choices Stats: 2025-12-04T09:14:58.5427295Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.5427575Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5427791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5428109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5429246Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5430364Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5431514Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5432663Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5433803Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5434914Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5435230Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.5435295Z Autotune Choices Stats: 2025-12-04T09:14:58.5436719Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5437160Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5437489Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5438056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5439226Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5440416Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5441601Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5442781Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5443929Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5445116Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5446277Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5447433Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5448627Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5449812Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5450055Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.5450188Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5450257Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5450321Z unimplemented [] 2025-12-04T09:14:58.5450433Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5450647Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5451836Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5451929Z graph_break [] 2025-12-04T09:14:58.5452055Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5452128Z Autotune Choices Stats: 2025-12-04T09:14:58.5453553Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.5453806Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5454029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5454348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5455479Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5456629Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5457773Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5458889Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5460031Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5461148Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5461432Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.5461498Z Autotune Choices Stats: 2025-12-04T09:14:58.5463076Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5463527Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5463851Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5464430Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5465659Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5466875Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5468078Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5469236Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5470436Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5471587Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5472741Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5473891Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5475124Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5476279Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5476554Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.5476692Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5476760Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5476821Z unimplemented [] 2025-12-04T09:14:58.5476926Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5477107Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5478292Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5478409Z graph_break [] 2025-12-04T09:14:58.5478548Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5478620Z Autotune Choices Stats: 2025-12-04T09:14:58.5480019Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5480270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5480488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5480804Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5481943Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5483096Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5484234Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5485386Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5486496Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5487646Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5487891Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.5487955Z Autotune Choices Stats: 2025-12-04T09:14:58.5489378Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.5489829Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5490157Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5490758Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5491946Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5493211Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5494373Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5495556Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5496709Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5497858Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5499022Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5500224Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5501409Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5502609Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5502853Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.5502988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5503093Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5503154Z unimplemented [] 2025-12-04T09:14:58.5503261Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5503445Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5504636Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5504700Z graph_break [] 2025-12-04T09:14:58.5504829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5504904Z Autotune Choices Stats: 2025-12-04T09:14:58.5506291Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5506540Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5506757Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5507078Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5508240Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5509402Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5510565Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5511682Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5512826Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5513946Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5514192Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.5514266Z Autotune Choices Stats: 2025-12-04T09:14:58.5515705Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5516143Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5516504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5517099Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5518261Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5519456Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5520611Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5521795Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5522946Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5524098Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5525278Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5526452Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5527648Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5528808Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5529083Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.5529219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5529288Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5529351Z unimplemented [] 2025-12-04T09:14:58.5529457Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5529642Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5530822Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5530885Z graph_break [] 2025-12-04T09:14:58.5531010Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5531097Z Autotune Choices Stats: 2025-12-04T09:14:58.5532524Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5532775Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5532992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5533382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5534509Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5535663Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5536775Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5537923Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5539038Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5540156Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5540398Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.5540469Z Autotune Choices Stats: 2025-12-04T09:14:58.5541925Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.5542414Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5542737Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5543314Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5544507Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5545663Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5546852Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5548001Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5549156Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5550334Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5551518Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5552661Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5553845Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5555029Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5555272Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.5555402Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5555481Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5555545Z unimplemented [] 2025-12-04T09:14:58.5555653Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5555835Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5557010Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5557077Z graph_break [] 2025-12-04T09:14:58.5557203Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5557273Z Autotune Choices Stats: 2025-12-04T09:14:58.5558692Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5558976Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5559194Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5559513Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5560636Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5561924Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5563054Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5564413Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5565524Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5566646Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5566902Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.5566974Z Autotune Choices Stats: 2025-12-04T09:14:58.5568453Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.5568936Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5569261Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5569911Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5571094Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5572326Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5573476Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5574630Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5575784Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5576985Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5578170Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5579349Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5580503Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5581688Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5581931Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.5582068Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5582137Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5582198Z unimplemented [] 2025-12-04T09:14:58.5582309Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5582495Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5583673Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5583736Z graph_break [] 2025-12-04T09:14:58.5583863Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5583935Z Autotune Choices Stats: 2025-12-04T09:14:58.5585372Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5585652Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5585870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5586198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5587371Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5588488Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5589633Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5590746Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5591867Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5592983Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5593291Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.5593367Z Autotune Choices Stats: 2025-12-04T09:14:58.5594794Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5595235Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5595594Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5596157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5597324Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5598521Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5599689Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5600844Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5602037Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5603224Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5604407Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5605556Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5606768Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5607922Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5608173Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.5608302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5608377Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5608438Z unimplemented [] 2025-12-04T09:14:58.5608543Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5608729Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5609915Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5610051Z graph_break [] 2025-12-04T09:14:58.5610180Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5610250Z Autotune Choices Stats: 2025-12-04T09:14:58.5611631Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.5611883Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5612134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5612487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5613632Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5614797Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5615911Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5617027Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5618138Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5619294Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5619568Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.5619637Z Autotune Choices Stats: 2025-12-04T09:14:58.5621095Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5621541Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5621865Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5622457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5623627Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5624782Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5625941Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5627124Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5628314Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5629502Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5630649Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5631828Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5632980Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5634137Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5634382Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.5634508Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5634584Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5634645Z unimplemented [] 2025-12-04T09:14:58.5634749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5634932Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5636141Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5636238Z graph_break [] 2025-12-04T09:14:58.5636362Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5636433Z Autotune Choices Stats: 2025-12-04T09:14:58.5637873Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5638127Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5638343Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5638656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5639830Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5640947Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5642061Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5643178Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5644327Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5645472Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5645714Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.5645784Z Autotune Choices Stats: 2025-12-04T09:14:58.5647241Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5647681Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5648039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5648605Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5649767Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5650928Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5652079Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5653353Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5654506Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5655687Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5656837Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5658018Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5659179Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5660333Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5660575Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.5660701Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5660822Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5660915Z unimplemented [] 2025-12-04T09:14:58.5661018Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5661211Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5662502Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5662572Z graph_break [] 2025-12-04T09:14:58.5662702Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5662772Z Autotune Choices Stats: 2025-12-04T09:14:58.5664222Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5664475Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5664738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5665051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5666202Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5667318Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5668435Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5669593Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5670773Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5671922Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5672165Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.5672234Z Autotune Choices Stats: 2025-12-04T09:14:58.5673658Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5674144Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5674472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5675032Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5676196Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5677348Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5678535Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5679715Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5680899Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5682044Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5683238Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5684386Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5685542Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5686725Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5687002Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.5687130Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5687202Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5687264Z unimplemented [] 2025-12-04T09:14:58.5687364Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5687550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5688729Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5688829Z graph_break [] 2025-12-04T09:14:58.5688956Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5689039Z Autotune Choices Stats: 2025-12-04T09:14:58.5690424Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5690707Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5690925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5691240Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5692408Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5693540Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5694659Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5695847Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5696966Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5698115Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5698355Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.5698459Z Autotune Choices Stats: 2025-12-04T09:14:58.5699887Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.5700325Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5700650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5701213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5702380Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5703598Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5704818Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5706009Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5707166Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5708346Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5709514Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5710668Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5711823Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5713030Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5713285Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.5713413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5713496Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5713559Z unimplemented [] 2025-12-04T09:14:58.5713660Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5713852Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5715066Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5715133Z graph_break [] 2025-12-04T09:14:58.5715260Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5715367Z Autotune Choices Stats: 2025-12-04T09:14:58.5716758Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5717007Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5717223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5717540Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5718676Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5719800Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5720970Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5722123Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5723272Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5724388Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5724664Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.5724734Z Autotune Choices Stats: 2025-12-04T09:14:58.5726158Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5726600Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5726929Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5727491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5728693Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5729889Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5731039Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5732217Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5733469Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5734624Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5735779Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5736928Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5738114Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5739302Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5739552Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.5739683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5739759Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5739856Z unimplemented [] 2025-12-04T09:14:58.5739960Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5740150Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5741329Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5741428Z graph_break [] 2025-12-04T09:14:58.5741558Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5741623Z Autotune Choices Stats: 2025-12-04T09:14:58.5743008Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5743253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5743477Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5743794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5744932Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5746078Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5747227Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5748377Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5749492Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5750640Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5750883Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.5750953Z Autotune Choices Stats: 2025-12-04T09:14:58.5752381Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.5752822Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5753147Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5753711Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5754908Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5756096Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5757283Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5758436Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5759619Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5760777Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5762062Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5763274Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5764480Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5765689Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5765939Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.5766068Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5766141Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5766201Z unimplemented [] 2025-12-04T09:14:58.5766302Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5766550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5767722Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5767791Z graph_break [] 2025-12-04T09:14:58.5767916Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5767983Z Autotune Choices Stats: 2025-12-04T09:14:58.5769373Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5769622Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5769843Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5770160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5771329Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5772510Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5773627Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5774780Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5775925Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5777050Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5777291Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.5777363Z Autotune Choices Stats: 2025-12-04T09:14:58.5778780Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5779220Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5779546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5780174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5781337Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5782526Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5783675Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5784861Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5786005Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5787163Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5788318Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5789540Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5790691Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5791877Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5792125Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.5792287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5792359Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5792420Z unimplemented [] 2025-12-04T09:14:58.5792520Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5792707Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5793881Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5793948Z graph_break [] 2025-12-04T09:14:58.5794071Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5794136Z Autotune Choices Stats: 2025-12-04T09:14:58.5795536Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.5795778Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5796000Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5796311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5797524Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5798647Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5799815Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5800925Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5802077Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5803199Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5803444Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.5803513Z Autotune Choices Stats: 2025-12-04T09:14:58.5804939Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5805411Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5805769Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5806327Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5807525Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5808679Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5809862Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5811011Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5812161Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5813369Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5814565Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5815742Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5816929Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5818076Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5818369Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.5818503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5818580Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5818642Z unimplemented [] 2025-12-04T09:14:58.5818755Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5818947Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5820125Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5820197Z graph_break [] 2025-12-04T09:14:58.5820327Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5820392Z Autotune Choices Stats: 2025-12-04T09:14:58.5821798Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.5822042Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5822338Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5822656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5823788Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5824941Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5826065Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5827213Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5828345Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5829472Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5829715Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.5829782Z Autotune Choices Stats: 2025-12-04T09:14:58.5831254Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5831725Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5832048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5832607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5833811Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5834967Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5836152Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5837308Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5838458Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5839688Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5840878Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5842064Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5843229Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5844409Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5844666Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.5844794Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5844871Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5844936Z unimplemented [] 2025-12-04T09:14:58.5845042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5845234Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5846405Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5846471Z graph_break [] 2025-12-04T09:14:58.5846596Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5846674Z Autotune Choices Stats: 2025-12-04T09:14:58.5848105Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5848382Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5848608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5848929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5850093Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5851207Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5852404Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5853517Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5854636Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5855766Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5856013Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:58.5856115Z Autotune Choices Stats: 2025-12-04T09:14:58.5857580Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.5858017Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5858371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5858931Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5860105Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5861316Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5862804Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5863986Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5865212Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5866434Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5867584Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5868778Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5869977Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5871131Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5871399Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:58.5871530Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5871608Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5871673Z unimplemented [] 2025-12-04T09:14:58.5871777Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5871968Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5873147Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5873212Z graph_break [] 2025-12-04T09:14:58.5873342Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5873447Z Autotune Choices Stats: 2025-12-04T09:14:58.5874885Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5875133Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5875356Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5875676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5876844Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5877961Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5879120Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5880240Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5881357Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5882510Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5882785Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:58.5882852Z Autotune Choices Stats: 2025-12-04T09:14:58.5884277Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.5884745Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5885077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5885635Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5886842Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5887999Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5889150Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5890305Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5891487Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5892725Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5893938Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5895099Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5896290Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5897444Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5897696Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:58.5897824Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5897897Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5897960Z unimplemented [] 2025-12-04T09:14:58.5898062Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5898250Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5899462Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5899582Z graph_break [] 2025-12-04T09:14:58.5899711Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5899778Z Autotune Choices Stats: 2025-12-04T09:14:58.5901167Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5901447Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5901669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5901981Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5903113Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5904260Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5905373Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5906489Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5907644Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5908810Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5909062Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:58.5909129Z Autotune Choices Stats: 2025-12-04T09:14:58.5910591Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.5911024Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5911362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5911959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5913128Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5914283Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5915436Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5916620Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5917802Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5918999Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5920144Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5921329Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5922482Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5923630Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5923876Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:58.5924005Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5924072Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5924139Z unimplemented [] 2025-12-04T09:14:58.5924240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5924464Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5925695Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5925764Z graph_break [] 2025-12-04T09:14:58.5925890Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5925958Z Autotune Choices Stats: 2025-12-04T09:14:58.5927375Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.5927618Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5927837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5928182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5929326Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5930438Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5931558Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5932721Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5933880Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5935028Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5935273Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:58.5935339Z Autotune Choices Stats: 2025-12-04T09:14:58.5936793Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.5937258Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5937593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5938154Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5939316Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5940471Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5941660Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5942839Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5944019Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5945174Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5946347Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5947494Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5948655Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.5949803Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5950054Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:58.5950292Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.5950374Z Traceback (most recent call last): 2025-12-04T09:14:58.5950684Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.5950750Z self.assertTrue( 2025-12-04T09:14:58.5950966Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.5951048Z raise self.failureException(msg) 2025-12-04T09:14:58.5951287Z AssertionError: False is not true : Log file /tmp/tmp3f354stf/flex_attention_configs.json was not created 2025-12-04T09:14:58.5951293Z 2025-12-04T09:14:58.5951435Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.5951692Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.5951698Z 2025-12-04T09:14:58.5951874Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.5952040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5952112Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5952183Z unimplemented [] 2025-12-04T09:14:58.5952289Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5953485Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.5953712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5953775Z graph_break [] 2025-12-04T09:14:58.5953912Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5954903Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.5954993Z current_size = base.storage().size() 2025-12-04T09:14:58.5955064Z Autotune Choices Stats: 2025-12-04T09:14:58.5956466Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.5956722Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5956942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5957276Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5958465Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5959613Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5960756Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5961992Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5963180Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5964300Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5964552Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.5964620Z Autotune Choices Stats: 2025-12-04T09:14:58.5966057Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.5966491Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5966867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5967480Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5968650Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5969846Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5970994Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5972223Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5973419Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5974577Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5975764Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5976939Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5978116Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5979277Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5979554Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.5979691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.5979774Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.5979841Z unimplemented [] 2025-12-04T09:14:58.5979947Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.5980133Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.5981315Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.5981376Z graph_break [] 2025-12-04T09:14:58.5981522Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.5981595Z Autotune Choices Stats: 2025-12-04T09:14:58.5982976Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.5983228Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5983443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5983832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5984960Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5986103Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5987207Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5988371Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.5989490Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.5990602Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5990852Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.5990918Z Autotune Choices Stats: 2025-12-04T09:14:58.5992348Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.5992850Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.5993180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.5993739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.5994943Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.5996093Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.5997279Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.5998426Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.5999584Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6000734Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6001941Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6003103Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6004284Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6005439Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6005832Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.6005971Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6006040Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6006102Z unimplemented [] 2025-12-04T09:14:58.6006209Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6006393Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6007572Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6007634Z graph_break [] 2025-12-04T09:14:58.6007768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6007837Z Autotune Choices Stats: 2025-12-04T09:14:58.6009219Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6009541Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6009762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6010093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6011216Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6012404Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6013531Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6014679Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6015790Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6016909Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6017161Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.6017230Z Autotune Choices Stats: 2025-12-04T09:14:58.6018705Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6019176Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6019506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6020116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6021276Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6022461Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6023618Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6024781Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6025934Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6027114Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6028288Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6029476Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6030621Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6031812Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6032056Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.6032196Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6032265Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6032326Z unimplemented [] 2025-12-04T09:14:58.6032435Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6032616Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6033794Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6033854Z graph_break [] 2025-12-04T09:14:58.6033979Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6034051Z Autotune Choices Stats: 2025-12-04T09:14:58.6035471Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6035752Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6035968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6036286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6037439Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6038551Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6039701Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6040817Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6041931Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6043038Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6043318Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.6043414Z Autotune Choices Stats: 2025-12-04T09:14:58.6044856Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6045293Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6045656Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6046222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6047382Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6048579Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6049740Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6050890Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6052073Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6053302Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6054485Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6055636Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6056815Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6057969Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6058214Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.6058349Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6058420Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6058483Z unimplemented [] 2025-12-04T09:14:58.6058604Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6058787Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6059975Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6060036Z graph_break [] 2025-12-04T09:14:58.6060231Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6060304Z Autotune Choices Stats: 2025-12-04T09:14:58.6061693Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6062055Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6062274Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6062654Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6063787Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6064961Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6066076Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6067194Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6068301Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6069470Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6069765Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.6069831Z Autotune Choices Stats: 2025-12-04T09:14:58.6071258Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6071737Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6072069Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6072638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6073834Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6074992Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6076151Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6077296Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6078519Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6079674Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6080859Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6082012Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6083190Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6084341Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6084583Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.6084716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6084784Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6084845Z unimplemented [] 2025-12-04T09:14:58.6084952Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6085134Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6086364Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6086459Z graph_break [] 2025-12-04T09:14:58.6086585Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6086660Z Autotune Choices Stats: 2025-12-04T09:14:58.6088043Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.6088325Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6088541Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6088858Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6089981Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6091137Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6092246Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6093424Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6094580Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6095735Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6095982Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.6096049Z Autotune Choices Stats: 2025-12-04T09:14:58.6097503Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6097942Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6098296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6098862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6100027Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6101190Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6102347Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6103529Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6104711Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6105933Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6107080Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6108274Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6109436Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6110593Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6110835Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.6110969Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6111037Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6111098Z unimplemented [] 2025-12-04T09:14:58.6111202Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6111453Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6112630Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6112691Z graph_break [] 2025-12-04T09:14:58.6112817Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6112890Z Autotune Choices Stats: 2025-12-04T09:14:58.6114304Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6114552Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6114767Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6115139Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6116271Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6117400Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6118523Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6119648Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6120819Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6121935Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6122217Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.6122284Z Autotune Choices Stats: 2025-12-04T09:14:58.6123699Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.6124170Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6124493Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6125053Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6126214Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6127373Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6128561Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6129749Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6130937Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6132084Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6133332Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6134482Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6135627Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6136778Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6137090Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.6137225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6137293Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6137354Z unimplemented [] 2025-12-04T09:14:58.6137462Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6137646Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6138825Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6138899Z graph_break [] 2025-12-04T09:14:58.6139070Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6139146Z Autotune Choices Stats: 2025-12-04T09:14:58.6140541Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.6140824Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6141048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6141370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6142493Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6143605Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6144711Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6145853Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6146985Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6148140Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6148391Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.6148461Z Autotune Choices Stats: 2025-12-04T09:14:58.6149884Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6150358Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6150679Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6151245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6152407Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6153587Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6154762Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6155899Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6157078Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6158248Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6159395Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6160540Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6161685Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6163048Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6163349Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.6163485Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6163554Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6163616Z unimplemented [] 2025-12-04T09:14:58.6163728Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6163913Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6165150Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6165213Z graph_break [] 2025-12-04T09:14:58.6165341Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6165411Z Autotune Choices Stats: 2025-12-04T09:14:58.6166802Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6167099Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6167317Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6167635Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6168759Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6169865Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6170999Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6172139Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6173354Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6174478Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6174751Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.6174820Z Autotune Choices Stats: 2025-12-04T09:14:58.6176247Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6176684Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6177012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6177584Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6178757Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6179948Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6181150Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6182337Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6183486Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6184668Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6185818Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6186966Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6188163Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6189358Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6189602Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.6189737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6189811Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6189875Z unimplemented [] 2025-12-04T09:14:58.6189985Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6190201Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6191380Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6191477Z graph_break [] 2025-12-04T09:14:58.6191602Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6191672Z Autotune Choices Stats: 2025-12-04T09:14:58.6193069Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6193321Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6193539Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6193860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6194985Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6196136Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6204276Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6205417Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6206585Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6207762Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6208018Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.6208097Z Autotune Choices Stats: 2025-12-04T09:14:58.6209543Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6209998Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6210329Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6210896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6212123Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6213387Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6214573Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6215720Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6216897Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6218034Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6219180Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6220313Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6221517Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6222662Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6222950Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.6223089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6223167Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6223229Z unimplemented [] 2025-12-04T09:14:58.6223333Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6223526Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6224744Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6224812Z graph_break [] 2025-12-04T09:14:58.6224941Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6225016Z Autotune Choices Stats: 2025-12-04T09:14:58.6226412Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.6226669Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6226889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6227212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6228360Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6229562Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6230689Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6231841Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6232943Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6234089Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6234335Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.6234412Z Autotune Choices Stats: 2025-12-04T09:14:58.6235836Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6236273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6236601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6237202Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6238397Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6239582Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6240728Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6241923Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6243076Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6244225Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6245373Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6246547Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6247723Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6248895Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6249148Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.6249280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6249390Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6249456Z unimplemented [] 2025-12-04T09:14:58.6249560Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6249751Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6250943Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6251009Z graph_break [] 2025-12-04T09:14:58.6251138Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6251211Z Autotune Choices Stats: 2025-12-04T09:14:58.6252642Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.6252892Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6253110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6253424Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6254598Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6255740Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6256888Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6257998Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6259144Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6260252Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6260501Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.6260573Z Autotune Choices Stats: 2025-12-04T09:14:58.6262216Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6262669Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6263122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6263700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6264866Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6266071Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6267218Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6268418Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6269568Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6270708Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6271893Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6273076Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6274276Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6275421Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6275702Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.6275837Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6275928Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6275990Z unimplemented [] 2025-12-04T09:14:58.6276095Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6276283Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6277457Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6277524Z graph_break [] 2025-12-04T09:14:58.6277654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6277719Z Autotune Choices Stats: 2025-12-04T09:14:58.6279123Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6279369Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6279621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6279969Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6281104Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6282247Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6283362Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6284510Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6285625Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6286739Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6287001Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.6287072Z Autotune Choices Stats: 2025-12-04T09:14:58.6288536Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.6289008Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6289334Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6289894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6291093Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6292250Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6293497Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6294655Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6295811Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6297001Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6298198Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6299384Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6300540Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6301717Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6301969Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.6302097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6302170Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6302232Z unimplemented [] 2025-12-04T09:14:58.6302334Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6302522Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6303700Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6303765Z graph_break [] 2025-12-04T09:14:58.6303890Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6303955Z Autotune Choices Stats: 2025-12-04T09:14:58.6305384Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6305673Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6305897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6306209Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6307349Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6308518Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6309675Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6310792Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6311907Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6313027Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6313273Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.6313343Z Autotune Choices Stats: 2025-12-04T09:14:58.6314802Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6315315Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6315637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6316239Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6317395Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6318579Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6319739Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6320894Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6322050Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6323275Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6324429Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6325608Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6326757Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6327939Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6328189Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.6328316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6328389Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6328454Z unimplemented [] 2025-12-04T09:14:58.6328555Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6328745Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6329914Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6329993Z graph_break [] 2025-12-04T09:14:58.6330121Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6330187Z Autotune Choices Stats: 2025-12-04T09:14:58.6331623Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6331893Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6332115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6332471Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6333642Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6334759Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6335916Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6337035Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6338149Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6339324Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6339599Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.6339672Z Autotune Choices Stats: 2025-12-04T09:14:58.6341091Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.6341531Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6341888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6342456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6343654Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6344811Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6345966Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6347117Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6348303Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6349486Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6350674Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6351814Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6353009Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6354157Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6354409Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.6354538Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6354609Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6354668Z unimplemented [] 2025-12-04T09:14:58.6354768Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6354955Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6356161Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6356259Z graph_break [] 2025-12-04T09:14:58.6356385Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6356449Z Autotune Choices Stats: 2025-12-04T09:14:58.6357834Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6358075Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6358331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6358642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6359781Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6360931Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6362172Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6363292Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6364410Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6365633Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6365877Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.6365947Z Autotune Choices Stats: 2025-12-04T09:14:58.6367421Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.6367862Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6368186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6368801Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6369969Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6371129Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6372279Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6373511Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6374687Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6375868Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6377018Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6378201Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6379357Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6380510Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6380760Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.6380886Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6380958Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6381020Z unimplemented [] 2025-12-04T09:14:58.6381121Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6381309Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6382514Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6382610Z graph_break [] 2025-12-04T09:14:58.6382737Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6382800Z Autotune Choices Stats: 2025-12-04T09:14:58.6384225Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6384467Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6384686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6385000Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6386164Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6387285Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6388417Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6389532Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6390685Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6391829Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6392069Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.6392136Z Autotune Choices Stats: 2025-12-04T09:14:58.6393596Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6394065Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6394390Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6394953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6396121Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6397280Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6398428Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6399657Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6400812Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6402016Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6403184Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6404369Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6405528Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6406674Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6406925Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.6407049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6407186Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6407249Z unimplemented [] 2025-12-04T09:14:58.6407347Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6407532Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6408697Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6408770Z graph_break [] 2025-12-04T09:14:58.6408899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6408967Z Autotune Choices Stats: 2025-12-04T09:14:58.6410393Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.6410665Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6410884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6411201Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6412367Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6413495Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6414614Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6415761Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6416905Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6418057Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6418300Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.6418364Z Autotune Choices Stats: 2025-12-04T09:14:58.6419792Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6420261Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6420586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6421139Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6422305Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6423461Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6424642Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6425822Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6427006Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6428162Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6429360Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6430502Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6431656Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6432833Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6433133Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.6433255Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6433327Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6433388Z unimplemented [] 2025-12-04T09:14:58.6433486Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6433671Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6434892Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6434959Z graph_break [] 2025-12-04T09:14:58.6435085Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6435151Z Autotune Choices Stats: 2025-12-04T09:14:58.6436534Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6436809Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6437030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6437359Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6438489Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6439604Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6440755Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6441897Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6443015Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6444170Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6444413Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.6444509Z Autotune Choices Stats: 2025-12-04T09:14:58.6445950Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6446381Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6446711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6447273Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6448442Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6449677Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6450864Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6452050Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6453235Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6454630Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6455781Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6456937Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6458082Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6459305Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6459554Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.6459683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6459754Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6459815Z unimplemented [] 2025-12-04T09:14:58.6459915Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6460107Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6461301Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6461368Z graph_break [] 2025-12-04T09:14:58.6461490Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6461585Z Autotune Choices Stats: 2025-12-04T09:14:58.6463097Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6463340Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6463559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6463874Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6465004Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6466112Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6467307Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6468477Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6469650Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6470771Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6471070Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.6471137Z Autotune Choices Stats: 2025-12-04T09:14:58.6472564Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6472995Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6473327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6473882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6475074Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6476254Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6477435Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6478581Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6479764Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6480929Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6482080Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6483233Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6484416Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6485588Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6485837Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.6485965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6486062Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6486128Z unimplemented [] 2025-12-04T09:14:58.6486229Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6486413Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6487583Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6487679Z graph_break [] 2025-12-04T09:14:58.6487805Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6487870Z Autotune Choices Stats: 2025-12-04T09:14:58.6489258Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6489503Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6489724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6490038Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6491169Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6492367Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6493542Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6494685Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6495804Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6496971Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6497211Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.6497276Z Autotune Choices Stats: 2025-12-04T09:14:58.6498704Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.6499145Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6499474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6500034Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6501267Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6502417Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6503608Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6504764Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6505948Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6507106Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6508255Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6509438Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6510631Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6511817Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6512067Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.6512193Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6512261Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6512328Z unimplemented [] 2025-12-04T09:14:58.6512429Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6512648Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6513815Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6513875Z graph_break [] 2025-12-04T09:14:58.6514022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6514088Z Autotune Choices Stats: 2025-12-04T09:14:58.6515477Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6515724Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6515947Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6516264Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6517433Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6518577Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6519727Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6520838Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6521992Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6523103Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6523352Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.6523415Z Autotune Choices Stats: 2025-12-04T09:14:58.6524846Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6525277Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6525638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6526226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6527392Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6528594Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6529746Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6530928Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6532074Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6533276Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6534460Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6535646Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6536796Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6537974Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6538254Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.6538379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6538445Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6538512Z unimplemented [] 2025-12-04T09:14:58.6538611Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6538798Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6539979Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6540039Z graph_break [] 2025-12-04T09:14:58.6540172Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6540234Z Autotune Choices Stats: 2025-12-04T09:14:58.6541618Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6541858Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6542079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6542390Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6543586Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6544696Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6545852Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6546968Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6548109Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6549230Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6549476Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.6549539Z Autotune Choices Stats: 2025-12-04T09:14:58.6550969Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.6551463Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6551790Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6552345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6553542Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6554692Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6555881Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6557034Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6558179Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6559339Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6560523Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6561724Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6563070Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6564224Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6564531Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.6564667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6564735Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6564801Z unimplemented [] 2025-12-04T09:14:58.6564903Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6565083Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6566259Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6566319Z graph_break [] 2025-12-04T09:14:58.6566453Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6566517Z Autotune Choices Stats: 2025-12-04T09:14:58.6567905Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6568247Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6568469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6568793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6569917Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6571062Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6572189Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6573373Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6574488Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6575598Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6575844Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.6575910Z Autotune Choices Stats: 2025-12-04T09:14:58.6577373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6577833Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6578161Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6578717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6579919Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6581073Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6582298Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6583455Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6584603Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6585795Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6586983Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6588167Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6589325Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6590501Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6590752Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.6590880Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6590949Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6591017Z unimplemented [] 2025-12-04T09:14:58.6591115Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6591296Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6592472Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6592533Z graph_break [] 2025-12-04T09:14:58.6592664Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6592726Z Autotune Choices Stats: 2025-12-04T09:14:58.6594159Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.6594444Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6594662Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6594973Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6596139Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6597272Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6598424Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6599534Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6602454Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6603632Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6603908Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.6604072Z Autotune Choices Stats: 2025-12-04T09:14:58.6605523Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6605969Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6606292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6606843Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6607997Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6609171Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6610317Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6611525Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6612775Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6613959Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6615101Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6616238Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6617405Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6618546Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6618793Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.6618935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6619010Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6619073Z unimplemented [] 2025-12-04T09:14:58.6619184Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6619373Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6620604Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6620666Z graph_break [] 2025-12-04T09:14:58.6620796Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6620933Z Autotune Choices Stats: 2025-12-04T09:14:58.6622316Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.6622563Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6622778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6623095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6624218Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6625368Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6626467Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6627570Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6628697Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6629830Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6630122Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.6630189Z Autotune Choices Stats: 2025-12-04T09:14:58.6631599Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6632049Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6632374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6632930Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6634123Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6635290Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6636468Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6637608Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6638819Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6639960Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6641115Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6642259Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6643428Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6644568Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6644812Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.6644982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6645050Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6645111Z unimplemented [] 2025-12-04T09:14:58.6645219Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6645405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6646623Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6646719Z graph_break [] 2025-12-04T09:14:58.6646849Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6646920Z Autotune Choices Stats: 2025-12-04T09:14:58.6648304Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6648552Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6648768Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6649083Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6650196Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6651337Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6652473Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6653635Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6654765Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6655921Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6656161Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:58.6656234Z Autotune Choices Stats: 2025-12-04T09:14:58.6657644Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.6658078Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6658433Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6658989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6660143Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6661290Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6662705Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6663919Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6665107Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6666244Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6667397Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6668584Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6669714Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6670891Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6671133Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:58.6671267Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6671334Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6671394Z unimplemented [] 2025-12-04T09:14:58.6671498Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6671746Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6672927Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6672995Z graph_break [] 2025-12-04T09:14:58.6673125Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6673195Z Autotune Choices Stats: 2025-12-04T09:14:58.6674569Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6674814Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6675028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6675387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6676501Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6677599Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6678732Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6679839Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6681000Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6682101Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6682359Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:58.6682438Z Autotune Choices Stats: 2025-12-04T09:14:58.6683858Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.6684334Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6684655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6685212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6686366Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6687554Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6688749Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6689920Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6691064Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6692205Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6693468Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6694621Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6695804Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6696951Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6697270Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:58.6697411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6697482Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6697545Z unimplemented [] 2025-12-04T09:14:58.6697657Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6697843Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6699027Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6699095Z graph_break [] 2025-12-04T09:14:58.6699227Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6699302Z Autotune Choices Stats: 2025-12-04T09:14:58.6700680Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6700974Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6701201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6701520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6702644Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6703752Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6704888Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6706028Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6707158Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6708263Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6708508Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:58.6708582Z Autotune Choices Stats: 2025-12-04T09:14:58.6710005Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.6710476Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6710809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6711370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6712560Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6713739Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6714961Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6716103Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6717250Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6718419Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6719560Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6720703Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6721919Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6723105Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6723385Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:58.6723521Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6723590Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6723651Z unimplemented [] 2025-12-04T09:14:58.6723770Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6723957Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6725124Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6725188Z graph_break [] 2025-12-04T09:14:58.6725312Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6725382Z Autotune Choices Stats: 2025-12-04T09:14:58.6726773Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.6727068Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6727289Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6727602Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6728721Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6729861Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6731006Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6732142Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6733300Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6734411Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6734704Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:58.6734774Z Autotune Choices Stats: 2025-12-04T09:14:58.6736197Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.6736629Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6736964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6737566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6738714Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6739895Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6741062Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6742204Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6743350Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6744522Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6745658Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6746850Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6748022Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6749212Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6749463Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:58.6749592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6749669Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6749732Z unimplemented [] 2025-12-04T09:14:58.6749847Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6750035Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6751205Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6751328Z graph_break [] 2025-12-04T09:14:58.6751453Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6751531Z Autotune Choices Stats: 2025-12-04T09:14:58.6752913Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.6753156Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6753370Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6753695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6754845Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6755988Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6757125Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6758230Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6759338Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6760487Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6760728Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:58.6760800Z Autotune Choices Stats: 2025-12-04T09:14:58.6762356Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:58.6762798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6763181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6763745Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6764947Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6766149Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6767290Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6768437Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6769639Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6770780Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6771956Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6773181Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6774411Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6775574Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6775822Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:58.6775993Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.6776077Z Traceback (most recent call last): 2025-12-04T09:14:58.6776377Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.6776447Z self.assertTrue( 2025-12-04T09:14:58.6776643Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.6776758Z raise self.failureException(msg) 2025-12-04T09:14:58.6777005Z AssertionError: False is not true : Log file /tmp/tmpxa4mrgd0/flex_attention_configs.json was not created 2025-12-04T09:14:58.6777012Z 2025-12-04T09:14:58.6777156Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.6777410Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.6777420Z 2025-12-04T09:14:58.6777583Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.6777716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6777792Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6777854Z unimplemented [] 2025-12-04T09:14:58.6777956Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6779144Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.6779328Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6779401Z graph_break [] 2025-12-04T09:14:58.6779530Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6780558Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.6780641Z current_size = base.storage().size() 2025-12-04T09:14:58.6780708Z Autotune Choices Stats: 2025-12-04T09:14:58.6782137Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.6782432Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6782652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6782968Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6784091Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6785185Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6786320Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6787414Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6788546Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6789649Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6789959Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.6790024Z Autotune Choices Stats: 2025-12-04T09:14:58.6791440Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.6791874Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6792217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6792770Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6793926Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6795108Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6796250Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6797420Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6798588Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6799761Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6800892Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6802028Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6803206Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6804342Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6804594Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.6804724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6804798Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6804860Z unimplemented [] 2025-12-04T09:14:58.6804995Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6805188Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6806358Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6806486Z graph_break [] 2025-12-04T09:14:58.6806617Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6806682Z Autotune Choices Stats: 2025-12-04T09:14:58.6808062Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6808304Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6808524Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6808835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6809950Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6811080Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6812193Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6813390Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6814504Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6815658Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6815932Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.6815998Z Autotune Choices Stats: 2025-12-04T09:14:58.6817419Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6817855Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6818188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6818785Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6819941Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6821079Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6822252Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6823423Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6824589Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6825730Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6826873Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6828042Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6829194Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6830324Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6830605Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.6830734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6830801Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6830869Z unimplemented [] 2025-12-04T09:14:58.6830972Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6831159Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6832366Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6832461Z graph_break [] 2025-12-04T09:14:58.6832588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6832655Z Autotune Choices Stats: 2025-12-04T09:14:58.6834039Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6834282Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6834505Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6834816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6835966Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6837075Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6838178Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6839314Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6840452Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6841588Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6841829Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.6841894Z Autotune Choices Stats: 2025-12-04T09:14:58.6843311Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6843740Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6844102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6844655Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6845807Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6846989Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6848136Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6849384Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6850528Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6851671Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6852854Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6854031Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6855170Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6856336Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6856584Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.6856722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6856793Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6856895Z unimplemented [] 2025-12-04T09:14:58.6857026Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6857211Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6858375Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6858434Z graph_break [] 2025-12-04T09:14:58.6858569Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6858633Z Autotune Choices Stats: 2025-12-04T09:14:58.6860021Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6860264Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6860517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6860827Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6862079Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6863186Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6864360Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6865531Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6866703Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6867804Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6868057Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.6868124Z Autotune Choices Stats: 2025-12-04T09:14:58.6869557Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6870039Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6870371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6870922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6872080Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6873250Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6874420Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6875593Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6876726Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6877866Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6879066Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6880201Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6881380Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6882509Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6882820Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.6882951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6883017Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6883084Z unimplemented [] 2025-12-04T09:14:58.6883186Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6883372Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6884546Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6884607Z graph_break [] 2025-12-04T09:14:58.6884736Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6884801Z Autotune Choices Stats: 2025-12-04T09:14:58.6886180Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6886455Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6886677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6886984Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6888109Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6889247Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6890349Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6891514Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6892702Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6893806Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6894054Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.6894118Z Autotune Choices Stats: 2025-12-04T09:14:58.6895731Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6896162Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6896488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6897045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6898247Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6899435Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6900609Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6901754Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6902888Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6904058Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6905200Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6906340Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6907510Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6908677Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6908985Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.6909115Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6909187Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6909253Z unimplemented [] 2025-12-04T09:14:58.6909353Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6909536Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6910715Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6910774Z graph_break [] 2025-12-04T09:14:58.6910902Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6910966Z Autotune Choices Stats: 2025-12-04T09:14:58.6912378Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.6912621Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6912839Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6913147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6914254Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6915386Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6916530Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6917661Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6918764Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6919865Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6920143Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.6920208Z Autotune Choices Stats: 2025-12-04T09:14:58.6921626Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6922054Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6922381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6922960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6924114Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6925320Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6926461Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6927609Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6928744Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6929918Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6931053Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6932223Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6933477Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6934646Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6934894Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.6935021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6935093Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6935162Z unimplemented [] 2025-12-04T09:14:58.6935262Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6935444Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6936621Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6936713Z graph_break [] 2025-12-04T09:14:58.6936845Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6936912Z Autotune Choices Stats: 2025-12-04T09:14:58.6938292Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6938532Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6938752Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6939070Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6940212Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6941367Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6942504Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6943603Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6944705Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6945835Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6946082Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.6946149Z Autotune Choices Stats: 2025-12-04T09:14:58.6947575Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.6948011Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6948382Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6948932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6950118Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6951287Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6952437Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6953588Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6954764Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6955902Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6957072Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6958245Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6959411Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6960554Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.6960804Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.6960933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6961002Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6961068Z unimplemented [] 2025-12-04T09:14:58.6961168Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6961361Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6962733Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6962799Z graph_break [] 2025-12-04T09:14:58.6962938Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6963004Z Autotune Choices Stats: 2025-12-04T09:14:58.6964409Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.6964655Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6964938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6965254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6966384Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6967580Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6968702Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6969830Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6970938Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6972090Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6972386Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.6972457Z Autotune Choices Stats: 2025-12-04T09:14:58.6973952Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.6974389Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6974719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6975338Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6976519Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6977664Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6978820Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6979989Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6981130Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6982349Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.6983491Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6984671Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.6985843Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6986988Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.6987250Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.6987413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.6987482Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.6987548Z unimplemented [] 2025-12-04T09:14:58.6987649Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.6987836Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.6989017Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.6989078Z graph_break [] 2025-12-04T09:14:58.6989211Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.6989278Z Autotune Choices Stats: 2025-12-04T09:14:58.6990705Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.6990949Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.6991163Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.6991495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.6992654Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6993796Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6994907Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6996024Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.6997167Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.6998269Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.6998525Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.6998590Z Autotune Choices Stats: 2025-12-04T09:14:58.7000049Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7000480Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7000874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7001431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7002597Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7003744Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7004941Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7006089Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7007241Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7008425Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7009604Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7010787Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7011929Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7013111Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7013390Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.7013526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7013594Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7013669Z unimplemented [] 2025-12-04T09:14:58.7013777Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7013964Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7015145Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7015207Z graph_break [] 2025-12-04T09:14:58.7015340Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7015406Z Autotune Choices Stats: 2025-12-04T09:14:58.7016834Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7017077Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7017354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7017684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7018810Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7019916Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7021025Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7022162Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7023278Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7024415Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7024664Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.7024728Z Autotune Choices Stats: 2025-12-04T09:14:58.7026178Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7026640Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7026966Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7027525Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7028702Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7029863Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7031039Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7032185Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7033364Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7034539Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7035737Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7036879Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7038017Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7039205Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7039449Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.7039581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7039647Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7039714Z unimplemented [] 2025-12-04T09:14:58.7039812Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7039992Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7041169Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7041263Z graph_break [] 2025-12-04T09:14:58.7041394Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7041459Z Autotune Choices Stats: 2025-12-04T09:14:58.7042874Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.7043153Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7043368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7043685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7044803Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7045916Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7047062Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7048188Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7049296Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7050436Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7050685Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.7050752Z Autotune Choices Stats: 2025-12-04T09:14:58.7052218Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7052723Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7053052Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7053609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7054777Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7055960Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7057105Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7058280Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7059425Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7060634Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7061880Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7063038Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7064182Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7065396Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7065642Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.7065778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7065846Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7065915Z unimplemented [] 2025-12-04T09:14:58.7066020Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7066215Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7067443Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7067509Z graph_break [] 2025-12-04T09:14:58.7067642Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7067718Z Autotune Choices Stats: 2025-12-04T09:14:58.7069166Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.7069462Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7069677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7070001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7071123Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7072235Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7073377Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7074486Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7075628Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7076764Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7077061Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.7077126Z Autotune Choices Stats: 2025-12-04T09:14:58.7078548Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7078992Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7079320Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7079879Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7081076Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7082219Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7083396Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7084576Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7085756Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7086933Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7088074Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7089227Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7090404Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7091549Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7091795Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.7091926Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7091992Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7092090Z unimplemented [] 2025-12-04T09:14:58.7092191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7092406Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7093631Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7093725Z graph_break [] 2025-12-04T09:14:58.7093860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7093927Z Autotune Choices Stats: 2025-12-04T09:14:58.7095310Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7095557Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7095771Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7096090Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7097199Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7098342Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7099446Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7100606Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7101727Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7102901Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7103150Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.7103215Z Autotune Choices Stats: 2025-12-04T09:14:58.7104639Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.7105069Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7105395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7105985Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7107150Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7108293Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7109474Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7110656Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7111834Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7112982Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7114121Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7115357Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7116499Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7117680Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7117924Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.7118056Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7118127Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7118191Z unimplemented [] 2025-12-04T09:14:58.7118296Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7118478Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7119714Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7119778Z graph_break [] 2025-12-04T09:14:58.7119910Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7119978Z Autotune Choices Stats: 2025-12-04T09:14:58.7121356Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7121604Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7121818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7122174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7123295Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7124408Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7125525Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7126671Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7127816Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7128960Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7129210Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.7129278Z Autotune Choices Stats: 2025-12-04T09:14:58.7130709Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7131190Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7131521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7132084Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7133306Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7134491Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7135672Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7136847Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7137995Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7139144Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7140325Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7141474Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7142623Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7143812Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7144057Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.7144257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7144329Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7144392Z unimplemented [] 2025-12-04T09:14:58.7144498Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7144681Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7145856Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7145915Z graph_break [] 2025-12-04T09:14:58.7146048Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7146116Z Autotune Choices Stats: 2025-12-04T09:14:58.7147491Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7147773Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7147988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7148307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7149433Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7150543Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7151691Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7152829Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7153977Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7155091Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7155342Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.7155409Z Autotune Choices Stats: 2025-12-04T09:14:58.7156826Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.7157293Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7157617Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7158196Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7159397Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7160551Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7161736Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7163098Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7164266Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7165408Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7166625Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7167770Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7168968Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7170161Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7170456Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.7170596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7170669Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7170749Z unimplemented [] 2025-12-04T09:14:58.7170864Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7171050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7172231Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7172291Z graph_break [] 2025-12-04T09:14:58.7172465Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7172539Z Autotune Choices Stats: 2025-12-04T09:14:58.7173944Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7174236Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7174453Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7174773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7175905Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7177063Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7178207Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7179358Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7180469Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7181577Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7181872Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.7181940Z Autotune Choices Stats: 2025-12-04T09:14:58.7183364Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.7183800Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7184127Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7184694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7185887Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7187070Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7188251Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7189396Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7190546Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7194963Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7196194Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7197447Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7198610Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7199863Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7200119Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.7200266Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7200341Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7200406Z unimplemented [] 2025-12-04T09:14:58.7200520Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7200712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7201896Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7201961Z graph_break [] 2025-12-04T09:14:58.7202133Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7202212Z Autotune Choices Stats: 2025-12-04T09:14:58.7203610Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7203867Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7204085Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7204410Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7205587Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7206714Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7207896Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7209002Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7210111Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7211219Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7211507Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.7211578Z Autotune Choices Stats: 2025-12-04T09:14:58.7213081Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7213523Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7213892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7214453Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7215639Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7216838Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7217988Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7219145Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7220331Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7221475Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7222623Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7223801Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7224971Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7226146Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7226395Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.7226539Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7226612Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7226676Z unimplemented [] 2025-12-04T09:14:58.7226792Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7226982Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7228164Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7228262Z graph_break [] 2025-12-04T09:14:58.7228391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7228465Z Autotune Choices Stats: 2025-12-04T09:14:58.7229854Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.7230103Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7230327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7230651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7231830Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7232979Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7234121Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7235242Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7236347Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7237500Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7237753Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.7237819Z Autotune Choices Stats: 2025-12-04T09:14:58.7239241Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7239726Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7240056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7240620Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7241847Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7243006Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7244167Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7245316Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7246500Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7247646Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7248826Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7250003Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7251222Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7252411Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7252674Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.7252810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7252882Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7252944Z unimplemented [] 2025-12-04T09:14:58.7253093Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7253278Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7254464Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7254527Z graph_break [] 2025-12-04T09:14:58.7254656Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7254730Z Autotune Choices Stats: 2025-12-04T09:14:58.7256109Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7256392Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7256624Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7256953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7258107Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7259249Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7260359Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7261475Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7262817Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7263937Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7264190Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.7264272Z Autotune Choices Stats: 2025-12-04T09:14:58.7265769Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7266216Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7266613Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7267233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7268403Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7269568Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7270722Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7271915Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7273064Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7274243Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7275423Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7276592Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7277744Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7278898Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7279179Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.7279316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7279390Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7279453Z unimplemented [] 2025-12-04T09:14:58.7279582Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7279773Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7280956Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7281018Z graph_break [] 2025-12-04T09:14:58.7281147Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7281224Z Autotune Choices Stats: 2025-12-04T09:14:58.7282642Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7282893Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7283114Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7283470Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7284625Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7285735Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7286844Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7287949Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7289091Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7290203Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7290449Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.7290524Z Autotune Choices Stats: 2025-12-04T09:14:58.7291984Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7292569Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7292902Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7293470Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7294630Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7295779Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7296963Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7298118Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7299311Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7300455Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7301670Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7302811Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7303961Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7305108Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7305386Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.7305523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7305597Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7305661Z unimplemented [] 2025-12-04T09:14:58.7305774Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7305958Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7307139Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7307211Z graph_break [] 2025-12-04T09:14:58.7307341Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7307417Z Autotune Choices Stats: 2025-12-04T09:14:58.7308837Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7309151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7309372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7309694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7310810Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7311930Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7313035Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7314185Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7315288Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7316443Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7316687Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.7316764Z Autotune Choices Stats: 2025-12-04T09:14:58.7318225Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.7318701Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7319028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7319598Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7320756Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7321956Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7323105Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7324266Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7325454Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7326652Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7327839Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7328991Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7330141Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7331323Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7331569Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.7331700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7331791Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7331855Z unimplemented [] 2025-12-04T09:14:58.7331966Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7332151Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7333460Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7333531Z graph_break [] 2025-12-04T09:14:58.7333658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7333734Z Autotune Choices Stats: 2025-12-04T09:14:58.7335150Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7335430Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7335647Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7335963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7337094Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7338231Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7339505Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7340622Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7341771Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7342885Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7343160Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.7343269Z Autotune Choices Stats: 2025-12-04T09:14:58.7344685Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7345123Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7345453Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7346020Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7347182Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7348366Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7349508Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7350682Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7351860Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7353032Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7354184Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7355324Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7356504Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7357655Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7357901Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.7358045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7358125Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7358187Z unimplemented [] 2025-12-04T09:14:58.7358296Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7358531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7359713Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7359783Z graph_break [] 2025-12-04T09:14:58.7359978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7360051Z Autotune Choices Stats: 2025-12-04T09:14:58.7361431Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7361679Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7362080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7362408Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7363552Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7364758Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7365882Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7367008Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7368164Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7369335Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7369628Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.7369703Z Autotune Choices Stats: 2025-12-04T09:14:58.7371127Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.7371572Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7371899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7372581Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7373762Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7374916Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7376101Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7377261Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7378482Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7379632Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7380782Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7381926Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7383113Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7384267Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7384548Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.7384681Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7384759Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7384823Z unimplemented [] 2025-12-04T09:14:58.7384937Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7385126Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7386395Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7386503Z graph_break [] 2025-12-04T09:14:58.7386637Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7386715Z Autotune Choices Stats: 2025-12-04T09:14:58.7388113Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7388370Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7388593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7388911Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7390066Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7391234Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7392349Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7393497Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7394639Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7395794Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7396040Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.7396116Z Autotune Choices Stats: 2025-12-04T09:14:58.7397538Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7397978Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7398339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7398910Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7400084Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7401236Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7402409Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7403591Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7404772Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7405919Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7407069Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7408242Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7409400Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7410583Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7410829Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.7410959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7411036Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7411097Z unimplemented [] 2025-12-04T09:14:58.7411234Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7411454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7412689Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7412762Z graph_break [] 2025-12-04T09:14:58.7412890Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7412962Z Autotune Choices Stats: 2025-12-04T09:14:58.7414342Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.7414586Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7414845Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7415161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7416303Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7417412Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7418553Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7419659Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7420851Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7421966Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7422212Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.7422287Z Autotune Choices Stats: 2025-12-04T09:14:58.7423709Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7424183Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7424513Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7425082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7426246Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7427443Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7428623Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7429808Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7430962Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7432108Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7433290Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7434434Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7435613Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7436755Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7437083Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.7437218Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7437298Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7437363Z unimplemented [] 2025-12-04T09:14:58.7437467Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7437657Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7438829Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7438901Z graph_break [] 2025-12-04T09:14:58.7439039Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7439114Z Autotune Choices Stats: 2025-12-04T09:14:58.7440492Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.7440783Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7441014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7441331Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7442459Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7443598Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7444712Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7445854Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7446988Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7448103Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7448347Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.7448422Z Autotune Choices Stats: 2025-12-04T09:14:58.7449841Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7450315Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7450641Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7451203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7452455Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7453672Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7454854Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7456012Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7457161Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7458338Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7459489Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7460630Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7462003Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7463216Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7463520Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.7463650Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7463725Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7463790Z unimplemented [] 2025-12-04T09:14:58.7463905Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7464093Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7465270Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7465338Z graph_break [] 2025-12-04T09:14:58.7465464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7465530Z Autotune Choices Stats: 2025-12-04T09:14:58.7466915Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7467223Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7467441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7467755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7468886Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7470022Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7471158Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7472301Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7473407Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7474517Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7474793Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:58.7474878Z Autotune Choices Stats: 2025-12-04T09:14:58.7476306Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.7476744Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7477072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7477670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7478829Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7480056Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7481198Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7482349Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7483492Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7484677Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7485827Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7487025Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7488204Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7489378Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7489630Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:58.7489758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7489835Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7489897Z unimplemented [] 2025-12-04T09:14:58.7489997Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7490197Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7491375Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7491480Z graph_break [] 2025-12-04T09:14:58.7491607Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7491677Z Autotune Choices Stats: 2025-12-04T09:14:58.7493120Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7493364Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7493585Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7493904Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7495069Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7496203Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7497349Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7498461Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7499575Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7500729Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7500971Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:58.7501046Z Autotune Choices Stats: 2025-12-04T09:14:58.7502459Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.7502899Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7503253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7503817Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7505012Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7506198Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7507359Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7508520Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7509709Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7510858Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7512048Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7513215Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7514400Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7515542Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7515795Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:58.7515925Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7516003Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7516064Z unimplemented [] 2025-12-04T09:14:58.7516166Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7516354Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7517583Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7517651Z graph_break [] 2025-12-04T09:14:58.7517777Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7517845Z Autotune Choices Stats: 2025-12-04T09:14:58.7519230Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7519474Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7519693Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7520088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7521212Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7522381Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7523489Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7524606Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7525721Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7526872Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7527118Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:58.7527194Z Autotune Choices Stats: 2025-12-04T09:14:58.7528640Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.7529077Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7529401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7529997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7531183Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7532377Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7533522Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7534705Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7535852Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7537034Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7538184Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7539380Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7540577Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7541717Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7541975Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:58.7542140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7542216Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7542279Z unimplemented [] 2025-12-04T09:14:58.7542383Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7542579Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7543759Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7543824Z graph_break [] 2025-12-04T09:14:58.7543950Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7544019Z Autotune Choices Stats: 2025-12-04T09:14:58.7545448Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.7545693Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7545916Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7546235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7547393Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7548543Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7549670Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7550787Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7551924Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7553043Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7553291Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:58.7553365Z Autotune Choices Stats: 2025-12-04T09:14:58.7554816Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.7555260Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7555648Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7556208Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7557378Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7558537Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7559691Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7560870Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7562198Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7563433Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7564629Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7565828Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7567001Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7568150Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7568452Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:58.7568588Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7568668Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7568731Z unimplemented [] 2025-12-04T09:14:58.7568836Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7569050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7570231Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7570303Z graph_break [] 2025-12-04T09:14:58.7570433Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7570502Z Autotune Choices Stats: 2025-12-04T09:14:58.7571935Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.7572184Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7572495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7572857Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7573991Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7575101Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7576213Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7577362Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7578472Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7579590Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7579871Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:58.7579939Z Autotune Choices Stats: 2025-12-04T09:14:58.7581410Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:58.7581889Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7582220Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7582779Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7583937Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7585087Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7586269Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7587414Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7588595Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7589774Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7590948Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7592093Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7593244Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7594440Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7594708Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:58.7594837Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7594911Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7594973Z unimplemented [] 2025-12-04T09:14:58.7595075Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7595262Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7596436Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7596539Z graph_break [] 2025-12-04T09:14:58.7596665Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7596731Z Autotune Choices Stats: 2025-12-04T09:14:58.7598148Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7598425Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7598649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7598962Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7600093Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7601202Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7602337Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7603444Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7604562Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7605707Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7605958Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:58.7606023Z Autotune Choices Stats: 2025-12-04T09:14:58.7607480Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.7607949Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7608272Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7608832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7610004Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7611208Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7612407Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7613613Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7614765Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7615980Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7617130Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7618272Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7619424Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7620596Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7620847Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:58.7621019Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.7621103Z Traceback (most recent call last): 2025-12-04T09:14:58.7621423Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.7621491Z self.assertTrue( 2025-12-04T09:14:58.7621696Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.7621779Z raise self.failureException(msg) 2025-12-04T09:14:58.7622055Z AssertionError: False is not true : Log file /tmp/tmptpkybqre/flex_attention_configs.json was not created 2025-12-04T09:14:58.7622061Z 2025-12-04T09:14:58.7622205Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.7622462Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.7622467Z 2025-12-04T09:14:58.7622638Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.7622770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7622877Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7622949Z unimplemented [] 2025-12-04T09:14:58.7623087Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7624287Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.7624475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7624536Z graph_break [] 2025-12-04T09:14:58.7624671Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7625662Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.7625753Z current_size = base.storage().size() 2025-12-04T09:14:58.7625822Z Autotune Choices Stats: 2025-12-04T09:14:58.7627214Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.7627497Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7627717Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7628045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7629166Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7630304Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7631442Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7632582Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7633702Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7634811Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7635069Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.7635173Z Autotune Choices Stats: 2025-12-04T09:14:58.7636595Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.7637033Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7637368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7637931Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7639129Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7640312Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7641486Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7642621Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7643768Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7644966Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7646107Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7647293Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7648447Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7649654Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7649905Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.7650040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7650112Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7650180Z unimplemented [] 2025-12-04T09:14:58.7650284Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7650473Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7651663Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7651724Z graph_break [] 2025-12-04T09:14:58.7651858Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7652011Z Autotune Choices Stats: 2025-12-04T09:14:58.7653438Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7653681Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7653898Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7654221Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7655377Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7656487Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7657637Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7658782Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7659896Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7661004Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7661291Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.7661360Z Autotune Choices Stats: 2025-12-04T09:14:58.7662909Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7663346Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7663681Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7664310Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7665520Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7666716Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7667870Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7669019Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7670213Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7671358Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7672500Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7673690Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7674866Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7676070Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7676319Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.7676457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7676528Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7676596Z unimplemented [] 2025-12-04T09:14:58.7676698Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7676887Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7678066Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7678166Z graph_break [] 2025-12-04T09:14:58.7678302Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7678368Z Autotune Choices Stats: 2025-12-04T09:14:58.7679756Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7680007Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7680226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7680549Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7681711Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7682870Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7684014Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7685133Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7686242Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7687375Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7687630Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.7687697Z Autotune Choices Stats: 2025-12-04T09:14:58.7689118Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7689585Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7689917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7690474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7691673Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7692906Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7694070Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7695216Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7696400Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7697543Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7698729Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7699919Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7701091Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7702239Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7702487Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.7702619Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7702688Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7702756Z unimplemented [] 2025-12-04T09:14:58.7702858Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7703082Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7704271Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7704331Z graph_break [] 2025-12-04T09:14:58.7704464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7704531Z Autotune Choices Stats: 2025-12-04T09:14:58.7705907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7706156Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7706406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7706737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7707909Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7709062Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7710169Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7711283Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7712420Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7713527Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7713773Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.7713840Z Autotune Choices Stats: 2025-12-04T09:14:58.7715294Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7715728Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7716061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7716692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7717854Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7719009Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7720162Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7721343Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7722492Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7723668Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7724808Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7726026Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7727174Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7728336Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7728581Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.7728748Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7728824Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7728894Z unimplemented [] 2025-12-04T09:14:58.7728998Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7729180Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7730366Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7730440Z graph_break [] 2025-12-04T09:14:58.7730576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7730647Z Autotune Choices Stats: 2025-12-04T09:14:58.7732062Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7732351Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7732571Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7732901Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7734094Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7735205Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7736317Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7737425Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7738565Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7739675Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7739930Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.7739997Z Autotune Choices Stats: 2025-12-04T09:14:58.7741464Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7741930Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7742294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7742855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7744038Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7745189Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7746365Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7747502Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7748645Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7749825Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7751008Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7752185Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7753325Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7754487Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7754762Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.7754897Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7754967Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7755029Z unimplemented [] 2025-12-04T09:14:58.7755135Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7755320Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7756500Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7756562Z graph_break [] 2025-12-04T09:14:58.7756693Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7756758Z Autotune Choices Stats: 2025-12-04T09:14:58.7758186Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.7758446Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7758729Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7759065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7760203Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7761319Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7762541Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7763724Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7764837Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7765995Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7766247Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.7766314Z Autotune Choices Stats: 2025-12-04T09:14:58.7767781Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7768262Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7768590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7769161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7770323Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7771477Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7772741Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7773891Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7775081Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7776258Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7777439Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7778585Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7779736Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7781298Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7781554Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.7781690Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7781760Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7781824Z unimplemented [] 2025-12-04T09:14:58.7781934Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7782117Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7783350Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7783413Z graph_break [] 2025-12-04T09:14:58.7783539Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7783610Z Autotune Choices Stats: 2025-12-04T09:14:58.7785023Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7785348Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7785566Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7785888Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7787013Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7788125Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7789275Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7790386Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7791499Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7792644Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7792894Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.7792962Z Autotune Choices Stats: 2025-12-04T09:14:58.7794455Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.7794892Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7795218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7795783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7796938Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7798122Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7799268Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7800449Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7801629Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7802820Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7803967Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7805112Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7806284Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7807431Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7807677Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.7807808Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7807881Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7807946Z unimplemented [] 2025-12-04T09:14:58.7808059Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7808247Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7809470Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7809533Z graph_break [] 2025-12-04T09:14:58.7809657Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7809726Z Autotune Choices Stats: 2025-12-04T09:14:58.7811180Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.7811429Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7811650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7811978Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7813148Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7814264Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7815432Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7816543Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7817680Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7818832Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7819110Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.7819177Z Autotune Choices Stats: 2025-12-04T09:14:58.7820604Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7821049Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7821374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7821941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7823138Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7824300Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7825486Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7826627Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7827812Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7828994Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7830148Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7831301Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7832481Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7833625Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7833871Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.7834016Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7834120Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7834185Z unimplemented [] 2025-12-04T09:14:58.7834291Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7834472Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7835697Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7835790Z graph_break [] 2025-12-04T09:14:58.7835922Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7835993Z Autotune Choices Stats: 2025-12-04T09:14:58.7837371Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7837621Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7837838Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7838166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7839292Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7840437Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7841539Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7842692Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7843796Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7844983Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7845229Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.7845297Z Autotune Choices Stats: 2025-12-04T09:14:58.7846721Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7847160Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7847485Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7848083Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7849239Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7850392Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7851584Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7852808Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7853994Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7855136Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7856288Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7857465Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7858607Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7859788Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7860033Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.7860170Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7860241Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7860304Z unimplemented [] 2025-12-04T09:14:58.7860412Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7860597Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7861980Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7862050Z graph_break [] 2025-12-04T09:14:58.7862196Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7862275Z Autotune Choices Stats: 2025-12-04T09:14:58.7863675Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7863928Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7864145Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7864513Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7865637Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7866739Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7867897Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7869019Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7870156Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7871311Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7871563Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.7871642Z Autotune Choices Stats: 2025-12-04T09:14:58.7873064Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7873536Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7873863Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7874425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7875578Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7876760Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7877939Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7879115Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7880259Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7881406Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7882601Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7883758Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7884902Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7886077Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7886327Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.7886528Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7886598Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7886660Z unimplemented [] 2025-12-04T09:14:58.7886770Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7886954Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7888128Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7888190Z graph_break [] 2025-12-04T09:14:58.7888320Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7888391Z Autotune Choices Stats: 2025-12-04T09:14:58.7889778Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.7890060Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7890280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7890599Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7891718Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7892881Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7894031Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7895187Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7896321Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7897428Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7897672Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.7897751Z Autotune Choices Stats: 2025-12-04T09:14:58.7899168Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7899657Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7899984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7900558Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7901747Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7902902Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7904122Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7905268Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7906420Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7907570Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7908760Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7909910Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7911095Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7912272Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7912550Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.7912684Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7912753Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7912814Z unimplemented [] 2025-12-04T09:14:58.7912932Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7913118Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7914286Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7914357Z graph_break [] 2025-12-04T09:14:58.7914485Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7914559Z Autotune Choices Stats: 2025-12-04T09:14:58.7915949Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.7916233Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7916455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7916774Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7917898Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7919045Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7920235Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7921380Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7922484Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7923594Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7923869Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.7923943Z Autotune Choices Stats: 2025-12-04T09:14:58.7925357Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7925794Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7926120Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7926688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7927877Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7929059Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7930252Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7931399Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7932579Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7933764Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7934918Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7936088Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7937265Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7938441Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7938689Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.7938826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7938896Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7938960Z unimplemented [] 2025-12-04T09:14:58.7939070Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7939256Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7940425Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7940527Z graph_break [] 2025-12-04T09:14:58.7940657Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7940730Z Autotune Choices Stats: 2025-12-04T09:14:58.7942115Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7942361Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7942578Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7942896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7944056Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7945171Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7946343Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7947468Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7948581Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7949700Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7949979Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.7950052Z Autotune Choices Stats: 2025-12-04T09:14:58.7951474Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.7951919Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7952281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7952842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7954031Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7955217Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7956363Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7957508Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7958687Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7959837Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7961010Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7962299Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7963596Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7964756Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7965003Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.7965138Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7965207Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7965268Z unimplemented [] 2025-12-04T09:14:58.7965375Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7965559Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7966729Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7966845Z graph_break [] 2025-12-04T09:14:58.7966973Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7967043Z Autotune Choices Stats: 2025-12-04T09:14:58.7968418Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7968668Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7968886Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7969253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7970378Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7971525Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7972745Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.7973864Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7974973Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.7976123Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7976367Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.7976441Z Autotune Choices Stats: 2025-12-04T09:14:58.7977861Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.7978331Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7978654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7979261Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7980453Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7981604Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7982752Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7983934Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7985082Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7986231Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.7987417Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.7988594Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.7989783Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7990940Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.7991196Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.7991327Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.7991399Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.7991497Z unimplemented [] 2025-12-04T09:14:58.7991606Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.7991791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.7992966Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.7993036Z graph_break [] 2025-12-04T09:14:58.7993163Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.7993240Z Autotune Choices Stats: 2025-12-04T09:14:58.7994628Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.7994928Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.7995148Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.7995463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.7996625Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7997773Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7998877Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.7999998Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8001140Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8002259Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8002503Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.8002580Z Autotune Choices Stats: 2025-12-04T09:14:58.8004029Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.8004467Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8004824Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8005422Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8006576Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8007726Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8008875Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8010065Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8011216Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8012432Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8013620Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8014796Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8015956Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8017105Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8017386Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.8017513Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8017590Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8017652Z unimplemented [] 2025-12-04T09:14:58.8017757Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8017940Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8019119Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8019185Z graph_break [] 2025-12-04T09:14:58.8019310Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8019395Z Autotune Choices Stats: 2025-12-04T09:14:58.8020807Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8021062Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8021280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8021660Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8022790Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8023904Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8025010Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8026175Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8027280Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8028397Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8028639Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.8028743Z Autotune Choices Stats: 2025-12-04T09:14:58.8030197Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.8030664Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8030988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8031552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8032713Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8033866Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8035042Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8036184Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8037364Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8038503Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8039728Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8040871Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8042025Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8043172Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8043446Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.8043573Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8043661Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8043723Z unimplemented [] 2025-12-04T09:14:58.8043825Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8044012Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8045184Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8045250Z graph_break [] 2025-12-04T09:14:58.8045375Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8045480Z Autotune Choices Stats: 2025-12-04T09:14:58.8046864Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8047176Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8047393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8047702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8048824Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8049944Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8051058Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8052207Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8053355Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8054551Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8054797Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.8054873Z Autotune Choices Stats: 2025-12-04T09:14:58.8056322Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8056806Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8057130Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8057699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8058858Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8060056Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8061207Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8062475Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8063687Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8064887Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8066092Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8067238Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8068386Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8069597Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8069847Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.8069981Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8070067Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8070132Z unimplemented [] 2025-12-04T09:14:58.8070238Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8070436Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8071645Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8071711Z graph_break [] 2025-12-04T09:14:58.8071840Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8071913Z Autotune Choices Stats: 2025-12-04T09:14:58.8073330Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.8073608Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8073823Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8074136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8075266Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8076373Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8077516Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8078627Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8079773Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8080879Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8081198Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.8081271Z Autotune Choices Stats: 2025-12-04T09:14:58.8082687Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8083122Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8083448Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8084013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8085174Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8086379Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8087523Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8088700Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8089900Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8091069Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8092217Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8093422Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8094608Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8095748Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8096007Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.8096138Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8096214Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8096278Z unimplemented [] 2025-12-04T09:14:58.8096380Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8096605Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8097789Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8097889Z graph_break [] 2025-12-04T09:14:58.8098046Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8098114Z Autotune Choices Stats: 2025-12-04T09:14:58.8099496Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8099743Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8099961Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8100274Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8101404Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8102539Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8103650Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8104767Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8105904Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8107049Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8107320Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.8107392Z Autotune Choices Stats: 2025-12-04T09:14:58.8108819Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8109265Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8109592Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8110189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8111350Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8112510Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8113685Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8114864Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8116034Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8117175Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8118327Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8119504Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8120654Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8121796Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8122112Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.8122242Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8122318Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8122382Z unimplemented [] 2025-12-04T09:14:58.8122485Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8122677Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8123880Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8123978Z graph_break [] 2025-12-04T09:14:58.8124121Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8124189Z Autotune Choices Stats: 2025-12-04T09:14:58.8125578Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8125819Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8126044Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8126358Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8127520Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8128631Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8129750Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8130888Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8132024Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8133214Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8133458Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.8133526Z Autotune Choices Stats: 2025-12-04T09:14:58.8134944Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8135380Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8135744Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8136308Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8137466Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8138645Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8139800Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8140985Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8142159Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8143318Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8144464Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8145641Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8146784Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8147955Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8148209Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.8148334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8148405Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8148501Z unimplemented [] 2025-12-04T09:14:58.8148634Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8148824Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8150005Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8150071Z graph_break [] 2025-12-04T09:14:58.8150197Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8150264Z Autotune Choices Stats: 2025-12-04T09:14:58.8151655Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8151894Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8152171Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8152493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8153622Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8154733Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8155881Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8157024Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8158161Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8159269Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8159512Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.8159581Z Autotune Choices Stats: 2025-12-04T09:14:58.8160998Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.8161464Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8161904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8162470Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8163635Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8164841Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8166035Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8167237Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8168383Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8169532Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8170727Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8171865Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8173101Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8174246Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8174564Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.8174694Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8174770Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8174838Z unimplemented [] 2025-12-04T09:14:58.8174941Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8175131Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8176305Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8176375Z graph_break [] 2025-12-04T09:14:58.8176503Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8176572Z Autotune Choices Stats: 2025-12-04T09:14:58.8177971Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8178255Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8181835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8182197Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8183359Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8184548Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8185684Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8186901Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8188016Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8189137Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8189394Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.8189466Z Autotune Choices Stats: 2025-12-04T09:14:58.8191005Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8191455Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8191785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8192349Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8193569Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8194777Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8195972Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8197127Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8198268Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8199459Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8200611Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8201750Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8202928Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8204112Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8204402Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.8204539Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8204619Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8204682Z unimplemented [] 2025-12-04T09:14:58.8204788Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8204980Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8206161Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8206240Z graph_break [] 2025-12-04T09:14:58.8206375Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8206440Z Autotune Choices Stats: 2025-12-04T09:14:58.8207886Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8208133Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8208355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8208672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8209806Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8210949Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8212103Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8213335Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8214458Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8215575Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8215857Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.8215927Z Autotune Choices Stats: 2025-12-04T09:14:58.8217361Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.8217798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8218123Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8218716Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8219884Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8221115Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8222260Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8223412Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8224554Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8225938Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8227091Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8228265Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8229457Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8230629Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8230886Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.8231016Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8231093Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8231155Z unimplemented [] 2025-12-04T09:14:58.8231259Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8231450Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8232630Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8232728Z graph_break [] 2025-12-04T09:14:58.8232853Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8232931Z Autotune Choices Stats: 2025-12-04T09:14:58.8234323Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8234566Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8234789Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8235105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8236274Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8237417Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8238570Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8239694Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8240832Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8241988Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8242237Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.8242315Z Autotune Choices Stats: 2025-12-04T09:14:58.8243753Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8244189Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8244566Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8245126Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8246336Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8247528Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8248679Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8249836Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8251015Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8252168Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8253412Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8254595Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8255785Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8256939Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8257193Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.8257326Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8257403Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8257466Z unimplemented [] 2025-12-04T09:14:58.8257570Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8257764Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8258970Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8259037Z graph_break [] 2025-12-04T09:14:58.8259171Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8259238Z Autotune Choices Stats: 2025-12-04T09:14:58.8260637Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.8260882Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8261140Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8261455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8262804Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8263989Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8265115Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8266227Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8267342Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8268512Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8268759Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.8268826Z Autotune Choices Stats: 2025-12-04T09:14:58.8270311Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8270746Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8271081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8271702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8272876Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8274036Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8275194Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8276385Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8277529Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8278713Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8279867Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8281092Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8282239Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8283385Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8283637Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.8283819Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8283887Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8283954Z unimplemented [] 2025-12-04T09:14:58.8284058Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8284253Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8285429Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8285496Z graph_break [] 2025-12-04T09:14:58.8285623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8285689Z Autotune Choices Stats: 2025-12-04T09:14:58.8287131Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.8287378Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8287602Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8287920Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8289106Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8290249Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8291379Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8292533Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8293696Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8294820Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8295068Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.8295134Z Autotune Choices Stats: 2025-12-04T09:14:58.8296606Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8297084Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8297452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8298011Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8299189Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8300348Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8301541Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8302702Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8303854Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8305052Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8306234Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8307419Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8308571Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8309719Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8310000Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.8310133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8310203Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8310273Z unimplemented [] 2025-12-04T09:14:58.8310377Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8310566Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8311743Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8311805Z graph_break [] 2025-12-04T09:14:58.8311939Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8312006Z Autotune Choices Stats: 2025-12-04T09:14:58.8313433Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8313677Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8313991Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8314304Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8315440Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8316554Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8317673Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8318821Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8319947Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8321093Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8321348Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:58.8321417Z Autotune Choices Stats: 2025-12-04T09:14:58.8322881Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.8323402Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8323735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8324296Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8325464Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8326618Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8327828Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8328990Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8330180Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8331367Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8332614Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8333776Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8334934Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8336142Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8336407Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:58.8336542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8336613Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8336682Z unimplemented [] 2025-12-04T09:14:58.8336792Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8336985Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8338196Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8338259Z graph_break [] 2025-12-04T09:14:58.8338395Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8338460Z Autotune Choices Stats: 2025-12-04T09:14:58.8339915Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8340192Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8340422Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8340739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8341876Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8343003Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8344148Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8345268Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8346390Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8347537Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8347789Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:58.8347856Z Autotune Choices Stats: 2025-12-04T09:14:58.8349371Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.8349816Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8350151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8350711Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8351877Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8353064Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8354220Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8355417Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8356569Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8357786Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8358938Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8360093Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8361251Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8362629Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8362900Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:58.8363037Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8363110Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8363186Z unimplemented [] 2025-12-04T09:14:58.8363293Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8363478Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8364733Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8364801Z graph_break [] 2025-12-04T09:14:58.8364949Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8365021Z Autotune Choices Stats: 2025-12-04T09:14:58.8366523Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8366780Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8367008Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8367329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8368470Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8369594Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8370768Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8371885Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8373107Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8374254Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8374541Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:58.8374607Z Autotune Choices Stats: 2025-12-04T09:14:58.8376034Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.8376467Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8376808Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8377375Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8378583Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8379752Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8380922Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8382129Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8383315Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8384494Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8385646Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8386819Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8388009Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8389158Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8389410Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:58.8389538Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8389639Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8389716Z unimplemented [] 2025-12-04T09:14:58.8389824Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8390009Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8391224Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8391327Z graph_break [] 2025-12-04T09:14:58.8391461Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8391527Z Autotune Choices Stats: 2025-12-04T09:14:58.8392922Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.8393166Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8393388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8393703Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8394836Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8395988Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8397100Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8398243Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8399353Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8400531Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8400776Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:58.8400842Z Autotune Choices Stats: 2025-12-04T09:14:58.8402277Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.8402710Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8403036Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8403627Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8404795Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8405952Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8407140Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8408330Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8409513Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8410660Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8411808Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8413063Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8414210Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8415399Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8415652Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:58.8415779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8415848Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8415915Z unimplemented [] 2025-12-04T09:14:58.8416014Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8416195Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8417457Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8417519Z graph_break [] 2025-12-04T09:14:58.8417653Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8417719Z Autotune Choices Stats: 2025-12-04T09:14:58.8419114Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.8419357Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8419581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8419933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8421066Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8422190Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8423342Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8424454Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8425612Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8426754Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8427004Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:58.8427076Z Autotune Choices Stats: 2025-12-04T09:14:58.8428505Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:58.8428965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8429302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8429860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8431029Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8432217Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8433407Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8434589Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8435739Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8436887Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8438068Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8439220Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8440379Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8441564Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8441814Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:58.8442026Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8442096Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8442166Z unimplemented [] 2025-12-04T09:14:58.8442265Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8442446Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8443621Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8443682Z graph_break [] 2025-12-04T09:14:58.8443811Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8443876Z Autotune Choices Stats: 2025-12-04T09:14:58.8445272Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8445546Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8445771Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8446087Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8447213Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8448331Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8449495Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8450639Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8451791Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8452942Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8453205Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:58.8453270Z Autotune Choices Stats: 2025-12-04T09:14:58.8454695Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.8455166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8455493Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8456065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8457272Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8458425Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8459700Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8460851Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8462122Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8463290Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8464513Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8465662Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8466859Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8468063Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8468365Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:58.8468495Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8468562Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8468625Z unimplemented [] 2025-12-04T09:14:58.8468723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8468906Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8470086Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8470146Z graph_break [] 2025-12-04T09:14:58.8470278Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8470343Z Autotune Choices Stats: 2025-12-04T09:14:58.8471733Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8472009Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8472223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8472546Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8473679Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8474826Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8475995Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8477142Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8478262Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8479378Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8479653Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:58.8479717Z Autotune Choices Stats: 2025-12-04T09:14:58.8481143Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.8481570Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8481897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8482458Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8483657Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8484842Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8486031Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8487187Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8488343Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8489543Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8490691Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8491890Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8493125Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8494313Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8494556Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:58.8494738Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.8494815Z Traceback (most recent call last): 2025-12-04T09:14:58.8495124Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.8495193Z self.assertTrue( 2025-12-04T09:14:58.8495389Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.8495478Z raise self.failureException(msg) 2025-12-04T09:14:58.8495726Z AssertionError: False is not true : Log file /tmp/tmp_4pivx94/flex_attention_configs.json was not created 2025-12-04T09:14:58.8495731Z 2025-12-04T09:14:58.8495873Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.8496127Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.8496170Z 2025-12-04T09:14:58.8496347Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.8496489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8496559Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8496625Z unimplemented [] 2025-12-04T09:14:58.8496736Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8497924Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.8498113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8498174Z graph_break [] 2025-12-04T09:14:58.8498311Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8499295Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.8499380Z current_size = base.storage().size() 2025-12-04T09:14:58.8499491Z Autotune Choices Stats: 2025-12-04T09:14:58.8500889Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.8501204Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8501420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8501747Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8502878Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8503995Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8505100Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8506251Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8507358Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8508524Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8508774Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.8508840Z Autotune Choices Stats: 2025-12-04T09:14:58.8510294Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.8510766Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8511086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8511652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8512822Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8514008Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8515154Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8516296Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8517476Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8518659Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8519841Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8520994Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8522145Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8523328Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8523572Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.8523720Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8523791Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8523852Z unimplemented [] 2025-12-04T09:14:58.8523962Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8524150Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8525366Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8525426Z graph_break [] 2025-12-04T09:14:58.8525551Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8525624Z Autotune Choices Stats: 2025-12-04T09:14:58.8527039Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8527317Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8527530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8527853Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8528985Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8530097Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8531242Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8532398Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8533561Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8534677Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8534922Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.8535058Z Autotune Choices Stats: 2025-12-04T09:14:58.8536491Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8536930Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8537255Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8537816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8538981Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8540191Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8541343Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8542530Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8543711Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8544881Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8546030Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8547172Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8548351Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8549505Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8549752Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.8549890Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8549958Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8550019Z unimplemented [] 2025-12-04T09:14:58.8550124Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8550339Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8551513Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8551581Z graph_break [] 2025-12-04T09:14:58.8551709Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8551844Z Autotune Choices Stats: 2025-12-04T09:14:58.8553226Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8553476Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8553690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8554013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8555140Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8556296Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8557416Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8558535Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8559680Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8560827Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8561100Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.8561172Z Autotune Choices Stats: 2025-12-04T09:14:58.8562791Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8563243Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8563572Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8564136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8565380Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8566564Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8567760Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8568912Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8570180Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8571325Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8572573Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8573720Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8574910Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8576057Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8576300Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.8576819Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8577108Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8577299Z unimplemented [] 2025-12-04T09:14:58.8577506Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8577877Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8579365Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8580728Z graph_break [] 2025-12-04T09:14:58.8580960Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8581240Z Autotune Choices Stats: 2025-12-04T09:14:58.8582744Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8584454Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8584990Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8585600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8587119Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8589467Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8591757Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8594086Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8596405Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8598723Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8600156Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.8600546Z Autotune Choices Stats: 2025-12-04T09:14:58.8602089Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8604329Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8605229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8606206Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8608009Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8610393Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8612945Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8615374Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8617777Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8620156Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8622525Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8624921Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8627286Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8629684Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8631148Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.8631614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8631896Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8632083Z unimplemented [] 2025-12-04T09:14:58.8632295Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8632749Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8634215Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8635512Z graph_break [] 2025-12-04T09:14:58.8635735Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8636010Z Autotune Choices Stats: 2025-12-04T09:14:58.8637512Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8639205Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8639748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8640395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8641912Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8644209Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8646541Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8648832Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8651199Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8653593Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8655020Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.8655401Z Autotune Choices Stats: 2025-12-04T09:14:58.8656941Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8658909Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8659740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8660708Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8662678Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8665164Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8667588Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8670008Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8672374Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8674731Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8677170Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8679542Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8681940Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8684308Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8685811Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.8686300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8686583Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8686774Z unimplemented [] 2025-12-04T09:14:58.8686973Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8687355Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8688788Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8690089Z graph_break [] 2025-12-04T09:14:58.8690311Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8690583Z Autotune Choices Stats: 2025-12-04T09:14:58.8692087Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.8693884Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8694430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8695039Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8696557Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8698884Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8701235Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8703569Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8705894Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8708183Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8709602Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.8709999Z Autotune Choices Stats: 2025-12-04T09:14:58.8711543Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8713504Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8714333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8715289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8717117Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8719500Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8721949Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8724317Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8726691Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8729059Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8731463Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8733914Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8736310Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8738704Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8740209Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.8740659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8740929Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8741121Z unimplemented [] 2025-12-04T09:14:58.8741325Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8741684Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8743127Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8744426Z graph_break [] 2025-12-04T09:14:58.8744645Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8744920Z Autotune Choices Stats: 2025-12-04T09:14:58.8746416Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8748146Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8748682Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8749301Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8750837Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8753179Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8755511Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8757839Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8760136Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8762693Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8764536Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.8764922Z Autotune Choices Stats: 2025-12-04T09:14:58.8766468Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.8768407Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8769251Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8770214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8772079Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8774653Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8777082Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8779461Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8781835Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8784243Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8786612Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8789022Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8791420Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8793833Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8795294Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.8795745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8796025Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8796212Z unimplemented [] 2025-12-04T09:14:58.8796415Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8796775Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8798215Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8799566Z graph_break [] 2025-12-04T09:14:58.8799783Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8800061Z Autotune Choices Stats: 2025-12-04T09:14:58.8801566Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.8803250Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8803781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8804389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8805956Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8811700Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8814033Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8816327Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8818585Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8820842Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8822239Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.8822626Z Autotune Choices Stats: 2025-12-04T09:14:58.8824164Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8826064Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8826938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8827875Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8829867Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8832235Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8834586Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8836930Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8839275Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8841612Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8843994Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8846345Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8848803Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8851159Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8852657Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:58.8853132Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8853413Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8853610Z unimplemented [] 2025-12-04T09:14:58.8853817Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8854174Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8855616Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8856928Z graph_break [] 2025-12-04T09:14:58.8857152Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8857424Z Autotune Choices Stats: 2025-12-04T09:14:58.8858930Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8860620Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8861159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8861993Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8863556Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8866047Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8868329Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8870617Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8872896Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8875181Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8876608Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:58.8876996Z Autotune Choices Stats: 2025-12-04T09:14:58.8878530Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8880510Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8881362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8882450Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8884253Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8886621Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8888987Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8891348Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8893749Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8896113Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8898504Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8900987Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8903338Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8905711Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8907170Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:58.8907626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8907906Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8908089Z unimplemented [] 2025-12-04T09:14:58.8908294Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8908656Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8910092Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8911382Z graph_break [] 2025-12-04T09:14:58.8911603Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8911890Z Autotune Choices Stats: 2025-12-04T09:14:58.8913393Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.8915113Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8915641Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8916311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8917847Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8920138Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8922405Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8924704Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8926999Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8929293Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8930716Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:58.8931116Z Autotune Choices Stats: 2025-12-04T09:14:58.8932754Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8934749Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8935621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8936576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8938381Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8940772Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8943135Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8945497Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8947854Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8950260Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8952739Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8955095Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.8957455Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8959805Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.8961269Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:58.8961733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.8962196Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.8962397Z unimplemented [] 2025-12-04T09:14:58.8962613Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.8962987Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.8964425Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.8965726Z graph_break [] 2025-12-04T09:14:58.8965953Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.8966224Z Autotune Choices Stats: 2025-12-04T09:14:58.8967819Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.8969644Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8970181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8970855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8972427Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8974751Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8977054Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.8979358Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.8981645Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8983936Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.8985395Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:58.8985778Z Autotune Choices Stats: 2025-12-04T09:14:58.8987346Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.8989342Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.8990172Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.8991136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.8992938Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.8995308Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.8997685Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9000062Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9002467Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9004858Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9007299Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9009669Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9012036Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9014459Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9015924Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:58.9016378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9016652Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9016843Z unimplemented [] 2025-12-04T09:14:58.9017044Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9017402Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9018858Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9020166Z graph_break [] 2025-12-04T09:14:58.9020436Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9020715Z Autotune Choices Stats: 2025-12-04T09:14:58.9022248Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:58.9024040Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9024576Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9025187Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9026711Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9029019Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9031314Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9033605Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9035903Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9038225Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9039728Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:58.9040110Z Autotune Choices Stats: 2025-12-04T09:14:58.9041687Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9043607Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9044438Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9045398Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9047205Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9049592Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9051955Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9054395Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9056763Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9059224Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9061588Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9064148Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9066521Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9068889Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9070366Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:58.9070829Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9071112Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9071302Z unimplemented [] 2025-12-04T09:14:58.9071508Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9071874Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9073385Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9074805Z graph_break [] 2025-12-04T09:14:58.9075026Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9075304Z Autotune Choices Stats: 2025-12-04T09:14:58.9076859Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9078555Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9079104Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9079713Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9081261Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9083573Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9085871Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9088168Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9090513Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9092858Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9094430Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:58.9094818Z Autotune Choices Stats: 2025-12-04T09:14:58.9096358Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.9098279Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9099116Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9100067Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9101870Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9104246Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9106632Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9109043Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9111556Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9113932Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9116303Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9118661Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9121027Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9123391Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9124858Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:58.9125316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9125601Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9125791Z unimplemented [] 2025-12-04T09:14:58.9126030Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9126390Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9127828Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9129227Z graph_break [] 2025-12-04T09:14:58.9129449Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9129720Z Autotune Choices Stats: 2025-12-04T09:14:58.9131235Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9132965Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9133502Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9134110Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9135627Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9137938Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9140234Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9142571Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9144867Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9147275Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9153149Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:58.9153552Z Autotune Choices Stats: 2025-12-04T09:14:58.9155135Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9157075Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9157912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9158889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9160699Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9163331Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9165822Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9168404Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9170772Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9173235Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9175599Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9177955Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9180318Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9182671Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9184168Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:58.9184630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9184991Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9185184Z unimplemented [] 2025-12-04T09:14:58.9185390Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9185750Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9187226Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9188524Z graph_break [] 2025-12-04T09:14:58.9188751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9189038Z Autotune Choices Stats: 2025-12-04T09:14:58.9190545Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9192239Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9192775Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9193392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9194911Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9197204Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9199504Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9201831Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9204200Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9206495Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9207918Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:58.9208298Z Autotune Choices Stats: 2025-12-04T09:14:58.9209857Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.9211778Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9212735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9213694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9215487Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9217902Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9220278Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9222736Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9225116Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9227474Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9229842Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9232209Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9234577Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9236986Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9238509Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:58.9238967Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9239249Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9239470Z unimplemented [] 2025-12-04T09:14:58.9239678Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9240044Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9241497Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9242794Z graph_break [] 2025-12-04T09:14:58.9243022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9243300Z Autotune Choices Stats: 2025-12-04T09:14:58.9244814Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9246510Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9247054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9247663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9249183Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9251475Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9253847Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9256238Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9258527Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9260826Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9262434Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:58.9262837Z Autotune Choices Stats: 2025-12-04T09:14:58.9264391Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:58.9266314Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9267144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9268097Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9269906Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9272335Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9274832Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9277220Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9279583Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9281949Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9284313Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9286675Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9289078Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9291473Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9293072Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:58.9293523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9293809Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9294006Z unimplemented [] 2025-12-04T09:14:58.9294202Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9294569Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9296015Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9297313Z graph_break [] 2025-12-04T09:14:58.9297538Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9297810Z Autotune Choices Stats: 2025-12-04T09:14:58.9299313Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9301015Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9301552Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9302160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9303669Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9306004Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9308296Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9310681Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9312967Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9315260Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9316681Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:58.9317067Z Autotune Choices Stats: 2025-12-04T09:14:58.9318597Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9320518Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9321350Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9322304Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9324152Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9326631Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9328999Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9331364Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9333801Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9336167Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9338528Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9340935Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9342092Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9343334Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9343585Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:58.9343718Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9343789Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9343861Z unimplemented [] 2025-12-04T09:14:58.9343963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9344149Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9345325Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9345386Z graph_break [] 2025-12-04T09:14:58.9345522Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9345588Z Autotune Choices Stats: 2025-12-04T09:14:58.9346980Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.9347223Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9347443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9347761Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9348918Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9350030Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9351253Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9352359Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9353595Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9354704Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9354954Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:58.9355018Z Autotune Choices Stats: 2025-12-04T09:14:58.9356444Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9356879Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9357212Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9357809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9359013Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9360222Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9361387Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9362739Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9363923Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9365088Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9366242Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9367463Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9368773Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9369928Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9370187Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:58.9370320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9370391Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9370460Z unimplemented [] 2025-12-04T09:14:58.9370564Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9370750Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9371952Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9372015Z graph_break [] 2025-12-04T09:14:58.9372150Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9372216Z Autotune Choices Stats: 2025-12-04T09:14:58.9373662Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9373912Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9374135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9374449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9375622Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9376835Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9377957Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9379074Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9380190Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9381305Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9381557Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:58.9381624Z Autotune Choices Stats: 2025-12-04T09:14:58.9383052Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9383489Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9383849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9384483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9385683Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9386834Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9387990Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9389135Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9390291Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9391446Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9392631Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9393902Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9395056Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9396210Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9396463Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:58.9396592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9396660Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9396729Z unimplemented [] 2025-12-04T09:14:58.9396835Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9397019Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9398201Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9398259Z graph_break [] 2025-12-04T09:14:58.9398391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9398458Z Autotune Choices Stats: 2025-12-04T09:14:58.9399844Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9400086Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9400337Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9400659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9401879Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9403000Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9404116Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9405224Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9406343Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9407463Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9407714Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:58.9407780Z Autotune Choices Stats: 2025-12-04T09:14:58.9409251Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9409683Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9410075Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9410664Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9411833Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9413019Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9414172Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9415322Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9416480Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9417693Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9418835Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9420080Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9421233Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9422388Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9422636Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:58.9422764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9422831Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9422897Z unimplemented [] 2025-12-04T09:14:58.9422997Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9423180Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9424361Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9424424Z graph_break [] 2025-12-04T09:14:58.9424560Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9424624Z Autotune Choices Stats: 2025-12-04T09:14:58.9426061Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9426309Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9426586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9426899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9428060Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9429183Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9430302Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9431409Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9432535Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9433671Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9433920Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:58.9433984Z Autotune Choices Stats: 2025-12-04T09:14:58.9435440Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.9435963Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9436294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9436850Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9438019Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9439171Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9440325Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9441477Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9442638Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9443837Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9445094Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9446247Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9447392Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9448542Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9448786Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:58.9448920Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9448987Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9449064Z unimplemented [] 2025-12-04T09:14:58.9449167Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9449349Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9450522Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9450583Z graph_break [] 2025-12-04T09:14:58.9450712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9450776Z Autotune Choices Stats: 2025-12-04T09:14:58.9452202Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9452541Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9452805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9453123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9454248Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9455368Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9456481Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9457593Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9458704Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9459848Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9460094Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:58.9460219Z Autotune Choices Stats: 2025-12-04T09:14:58.9461676Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9462272Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9462609Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9463174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9464356Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9465516Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9466681Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9467834Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9469048Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9470329Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9471477Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9472629Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9473771Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9474926Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9475169Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:58.9475305Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9475375Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9475440Z unimplemented [] 2025-12-04T09:14:58.9475539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9475721Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9476930Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9476993Z graph_break [] 2025-12-04T09:14:58.9477124Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9477278Z Autotune Choices Stats: 2025-12-04T09:14:58.9478697Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9478957Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9479172Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9479491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9480619Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9481729Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9482841Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9483957Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9485072Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9486216Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9486520Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:58.9486588Z Autotune Choices Stats: 2025-12-04T09:14:58.9488049Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:58.9488478Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9488820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9489377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9490547Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9491697Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9492913Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9494098Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9495277Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9496491Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9497638Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9498788Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9499930Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9501081Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9501324Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:58.9501461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9501532Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9501597Z unimplemented [] 2025-12-04T09:14:58.9501708Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9501892Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9503096Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9503230Z graph_break [] 2025-12-04T09:14:58.9503361Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9503426Z Autotune Choices Stats: 2025-12-04T09:14:58.9505194Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9505454Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9505670Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9506007Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9507143Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9508265Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9509374Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9510495Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9511643Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9512861Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9513110Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:58.9513176Z Autotune Choices Stats: 2025-12-04T09:14:58.9514605Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9515036Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9515366Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9515922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9517095Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9518243Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9519427Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9520573Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9521813Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9522961Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9524105Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9525260Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9526402Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9527553Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9527794Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:58.9528009Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9528083Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9528144Z unimplemented [] 2025-12-04T09:14:58.9528251Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9528433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9529700Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9529761Z graph_break [] 2025-12-04T09:14:58.9529891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9529960Z Autotune Choices Stats: 2025-12-04T09:14:58.9531338Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.9531583Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9531799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9532120Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9533302Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9534426Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9535542Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9536692Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9537842Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9539021Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9539269Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:58.9539334Z Autotune Choices Stats: 2025-12-04T09:14:58.9540775Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9541203Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9541536Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9542101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9543270Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9544422Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9545628Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9546868Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9548017Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9549172Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9550324Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9551473Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9552616Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9553800Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9554043Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:58.9554239Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9554309Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9554368Z unimplemented [] 2025-12-04T09:14:58.9554475Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9554687Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9555861Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9555922Z graph_break [] 2025-12-04T09:14:58.9556055Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9556126Z Autotune Choices Stats: 2025-12-04T09:14:58.9557505Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:58.9557748Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9557962Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9558283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9559406Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9560517Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9561660Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9562952Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9564243Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9565360Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9565611Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:58.9565675Z Autotune Choices Stats: 2025-12-04T09:14:58.9567092Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9567526Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9567855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9568418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9569577Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9570773Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9572046Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9573252Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9574406Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9575548Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9576698Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9577847Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9579028Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9580182Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9580488Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:58.9580654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9580723Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9580783Z unimplemented [] 2025-12-04T09:14:58.9580887Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9581068Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9582240Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9582301Z graph_break [] 2025-12-04T09:14:58.9582428Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9582500Z Autotune Choices Stats: 2025-12-04T09:14:58.9583888Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9584134Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9584360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9584680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9585805Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9586918Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9588059Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9589267Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9590383Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9591495Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9591742Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:58.9591807Z Autotune Choices Stats: 2025-12-04T09:14:58.9593228Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.9593662Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9593984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9594549Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9595742Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9596910Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9598153Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9599312Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9600470Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9601614Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9602766Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9603913Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9605087Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9606342Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9606585Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:58.9606719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9606790Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9606855Z unimplemented [] 2025-12-04T09:14:58.9606963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9607146Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9608322Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9608382Z graph_break [] 2025-12-04T09:14:58.9608507Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9608580Z Autotune Choices Stats: 2025-12-04T09:14:58.9609970Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9610217Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9610433Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9610761Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9611887Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9613096Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9614304Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9615425Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9616540Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9617655Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9617906Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:58.9617970Z Autotune Choices Stats: 2025-12-04T09:14:58.9619389Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.9619818Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9620142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9620703Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9621893Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9623141Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9624291Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9625440Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9626594Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9627756Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9628906Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9630089Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9631325Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9632478Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9632719Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:58.9632848Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9632916Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9632975Z unimplemented [] 2025-12-04T09:14:58.9633081Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9633263Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9634440Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9634501Z graph_break [] 2025-12-04T09:14:58.9634628Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9634698Z Autotune Choices Stats: 2025-12-04T09:14:58.9636083Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9636338Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9636559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9636874Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9638037Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9639162Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9640427Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9641545Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9642658Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9643781Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9644031Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:58.9644093Z Autotune Choices Stats: 2025-12-04T09:14:58.9645513Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.9645947Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9646299Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9646857Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9648110Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9649289Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9650445Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9651591Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9652784Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9653930Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9655119Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9656268Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9657556Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9658706Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9658952Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:58.9659084Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9659151Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9659214Z unimplemented [] 2025-12-04T09:14:58.9659322Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9659503Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9660682Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9660741Z graph_break [] 2025-12-04T09:14:58.9660862Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9660936Z Autotune Choices Stats: 2025-12-04T09:14:58.9662492Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.9662748Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9662962Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9663348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9664492Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9665741Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9666854Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9667969Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9669075Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9670192Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9670439Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:58.9670508Z Autotune Choices Stats: 2025-12-04T09:14:58.9671930Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.9672427Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9672752Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9673407Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9674575Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9675734Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9676885Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9678030Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9679180Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9680327Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9681509Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9682749Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9683913Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9685073Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9685314Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:58.9685444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9685514Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9685574Z unimplemented [] 2025-12-04T09:14:58.9685679Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9685859Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9687036Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9687101Z graph_break [] 2025-12-04T09:14:58.9687226Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9687300Z Autotune Choices Stats: 2025-12-04T09:14:58.9688681Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.9688958Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9689176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9689556Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9690718Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9691834Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9693016Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9694134Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9695239Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9696352Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9696593Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:58.9696662Z Autotune Choices Stats: 2025-12-04T09:14:58.9698119Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:58.9698617Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9698985Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9699546Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9700699Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9701861Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9703007Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9704168Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9705321Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9706498Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9707740Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9708887Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9710038Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9711189Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9711445Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:58.9711577Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9711649Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9711710Z unimplemented [] 2025-12-04T09:14:58.9711817Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9711998Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9713170Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9713231Z graph_break [] 2025-12-04T09:14:58.9713357Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9713432Z Autotune Choices Stats: 2025-12-04T09:14:58.9714851Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9715156Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9715371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9715721Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9716847Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9717973Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9719086Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9720206Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9721319Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9722437Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9722710Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:58.9722782Z Autotune Choices Stats: 2025-12-04T09:14:58.9724236Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.9724725Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9725048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9725609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9726772Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9727923Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9729086Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9730240Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9731434Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9732618Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9733888Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9735037Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9736186Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9737337Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9737581Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:58.9737715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9737783Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9737843Z unimplemented [] 2025-12-04T09:14:58.9737956Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9738142Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9739323Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9739389Z graph_break [] 2025-12-04T09:14:58.9739555Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9739626Z Autotune Choices Stats: 2025-12-04T09:14:58.9741045Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9741350Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9741565Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9741882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9743002Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9744114Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9745226Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9746340Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9747452Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9748596Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9748898Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:58.9748968Z Autotune Choices Stats: 2025-12-04T09:14:58.9750416Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:58.9750849Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9751186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9751748Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9752906Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9754060Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9755215Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9756371Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9757557Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9758796Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9759947Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9761091Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9762435Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9763616Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9763858Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:58.9763997Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9764064Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9764124Z unimplemented [] 2025-12-04T09:14:58.9764228Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9764411Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9765648Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9765822Z graph_break [] 2025-12-04T09:14:58.9765952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9766024Z Autotune Choices Stats: 2025-12-04T09:14:58.9767454Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.9767700Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9767917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9768228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9769366Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9770482Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9771594Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9772778Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9773932Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9775042Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9775385Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:58.9775456Z Autotune Choices Stats: 2025-12-04T09:14:58.9776880Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.9777315Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9777638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9778203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9779387Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9780546Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9781699Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9782891Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9784154Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9785307Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9786475Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9787620Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9788778Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9789936Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9790187Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:58.9790358Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:58.9790440Z Traceback (most recent call last): 2025-12-04T09:14:58.9790773Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:58.9790845Z self.assertTrue( 2025-12-04T09:14:58.9791043Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:58.9791127Z raise self.failureException(msg) 2025-12-04T09:14:58.9791479Z AssertionError: False is not true : Log file /tmp/tmp5v2t29we/flex_attention_configs.json was not created 2025-12-04T09:14:58.9791483Z 2025-12-04T09:14:58.9791617Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:58.9791877Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:58.9791881Z 2025-12-04T09:14:58.9792078Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:58.9792211Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9792285Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9792346Z unimplemented [] 2025-12-04T09:14:58.9792451Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9793641Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:58.9793826Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9793903Z graph_break [] 2025-12-04T09:14:58.9794032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9795022Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:58.9795104Z current_size = base.storage().size() 2025-12-04T09:14:58.9795171Z Autotune Choices Stats: 2025-12-04T09:14:58.9796566Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:58.9796810Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9797034Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9797350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9798483Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9799642Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9800847Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9801954Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9803063Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9804166Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9804408Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:58.9804474Z Autotune Choices Stats: 2025-12-04T09:14:58.9805903Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:58.9806339Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9806662Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9807257Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9808434Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9809686Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9810829Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9811974Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9813177Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9814326Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9815475Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9816653Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9817886Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9819033Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9819286Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:58.9819414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9819489Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9819548Z unimplemented [] 2025-12-04T09:14:58.9819651Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9819840Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9821007Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9821073Z graph_break [] 2025-12-04T09:14:58.9821199Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9821265Z Autotune Choices Stats: 2025-12-04T09:14:58.9822658Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9822899Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9823121Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9823437Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9824610Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9825827Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9826948Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9828056Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9829164Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9830275Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9830516Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:58.9830582Z Autotune Choices Stats: 2025-12-04T09:14:58.9832008Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9832440Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9832803Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9833421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9834612Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9835763Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9836914Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9838062Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9839219Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9840370Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9841549Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9842805Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9843951Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9845102Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9845352Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:58.9845479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9845550Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9845609Z unimplemented [] 2025-12-04T09:14:58.9845724Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9845926Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9847110Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9847174Z graph_break [] 2025-12-04T09:14:58.9847304Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9847371Z Autotune Choices Stats: 2025-12-04T09:14:58.9848775Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9849021Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9849282Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9849597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9850727Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9851932Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9853117Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9854233Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9855347Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9856461Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9856707Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:58.9856784Z Autotune Choices Stats: 2025-12-04T09:14:58.9858253Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9858686Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9859093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9859679Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9860863Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9862174Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9863370Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9864524Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9865665Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9866874Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9868022Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9869316Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9870467Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9871616Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9871869Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:58.9872003Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9872071Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9872138Z unimplemented [] 2025-12-04T09:14:58.9872242Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9872428Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9873605Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9873671Z graph_break [] 2025-12-04T09:14:58.9873797Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9873862Z Autotune Choices Stats: 2025-12-04T09:14:58.9875297Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9875543Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9875764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9876141Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9877315Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9878427Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9879556Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9880666Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9881779Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9882888Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9883132Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:58.9883196Z Autotune Choices Stats: 2025-12-04T09:14:58.9884655Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9885142Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9885502Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9886056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9887218Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9888365Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9889518Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9890670Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9891819Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9893054Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9894304Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9895457Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9896607Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9897748Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9897998Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:58.9898125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9898195Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9898260Z unimplemented [] 2025-12-04T09:14:58.9898361Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9898554Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9899726Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9899788Z graph_break [] 2025-12-04T09:14:58.9899924Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9899989Z Autotune Choices Stats: 2025-12-04T09:14:58.9901412Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9901723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9901977Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9902290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9903422Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9904539Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9905656Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9906763Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9907880Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9909006Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9909277Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:58.9909342Z Autotune Choices Stats: 2025-12-04T09:14:58.9910800Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9911302Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9911632Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9912193Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9913357Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9914505Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9915668Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9916819Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9917996Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9919242Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9920390Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9921543Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9922688Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9923835Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9924090Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:58.9924222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9924300Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9924370Z unimplemented [] 2025-12-04T09:14:58.9924477Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9924667Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9925841Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9925906Z graph_break [] 2025-12-04T09:14:58.9926084Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9926161Z Autotune Choices Stats: 2025-12-04T09:14:58.9927629Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.9927933Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9928156Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9928472Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9929608Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9930748Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9931882Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9933035Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9934148Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9935296Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9935599Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:58.9935663Z Autotune Choices Stats: 2025-12-04T09:14:58.9937127Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9937567Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9937904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9938464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9939641Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9940800Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9941950Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9943143Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9944291Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9945704Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9946851Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9948010Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9949167Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9950311Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9950561Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:58.9950689Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9950758Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9950825Z unimplemented [] 2025-12-04T09:14:58.9950928Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9951117Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9952321Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9952443Z graph_break [] 2025-12-04T09:14:58.9952574Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9952641Z Autotune Choices Stats: 2025-12-04T09:14:58.9954082Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:58.9954321Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9954546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9954875Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9956011Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9957119Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9958239Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9959353Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9960503Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9961707Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9962094Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:58.9962163Z Autotune Choices Stats: 2025-12-04T09:14:58.9963614Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:58.9964052Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9964383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9964938Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9966111Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9967271Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9968432Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9969655Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9970956Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9972114Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9973298Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9974446Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9975607Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9976750Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:58.9977006Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:58.9977137Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:58.9977207Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:58.9977275Z unimplemented [] 2025-12-04T09:14:58.9977412Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:58.9977604Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:58.9978813Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:58.9978941Z graph_break [] 2025-12-04T09:14:58.9979075Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:58.9979142Z Autotune Choices Stats: 2025-12-04T09:14:58.9980527Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:58.9980766Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9980989Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9981302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9982426Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9983540Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9984653Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:58.9985802Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:58.9986921Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9988141Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9988394Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:58.9988461Z Autotune Choices Stats: 2025-12-04T09:14:58.9989892Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:58.9990321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:58.9990649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:58.9991209Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:58.9992374Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9993535Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:58.9994718Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9995964Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:58.9997112Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:58.9998272Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:58.9999423Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0000594Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0001746Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0002929Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0003178Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.0003304Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0003436Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0003502Z unimplemented [] 2025-12-04T09:14:59.0003600Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0003780Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0004987Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0005047Z graph_break [] 2025-12-04T09:14:59.0005178Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0005247Z Autotune Choices Stats: 2025-12-04T09:14:59.0006631Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0006873Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0007091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0007407Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0008537Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0009645Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0010763Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0011905Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0013158Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0014276Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0014525Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.0014591Z Autotune Choices Stats: 2025-12-04T09:14:59.0016012Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0016443Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0016770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0017326Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0018501Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0019686Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0020848Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0022113Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0023260Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0024411Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0025559Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0026710Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0027869Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0029047Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0029353Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.0029481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0029583Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0029650Z unimplemented [] 2025-12-04T09:14:59.0029750Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0029932Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0031108Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0031170Z graph_break [] 2025-12-04T09:14:59.0031304Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0031368Z Autotune Choices Stats: 2025-12-04T09:14:59.0032768Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0033011Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0033232Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0033546Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0034669Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0035785Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0036953Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0038164Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0039289Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0040406Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0040658Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.0040727Z Autotune Choices Stats: 2025-12-04T09:14:59.0042151Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0042581Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0042909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0043464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0044630Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0045805Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0047064Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0048222Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0049375Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0050523Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0051668Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0052877Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0054074Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0055315Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0055561Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.0055688Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0055755Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0055824Z unimplemented [] 2025-12-04T09:14:59.0055923Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0056106Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0057280Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0057338Z graph_break [] 2025-12-04T09:14:59.0057466Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0057533Z Autotune Choices Stats: 2025-12-04T09:14:59.0058916Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.0059155Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0059373Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0059686Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0060808Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0062193Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0063376Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0064579Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0065696Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0066803Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0067053Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.0067118Z Autotune Choices Stats: 2025-12-04T09:14:59.0068554Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0068985Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0069316Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0069872Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0071071Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0072314Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0073487Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0074641Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0075793Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0076952Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0078101Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0079290Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0080436Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0081700Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0081954Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.0082085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0082153Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0082219Z unimplemented [] 2025-12-04T09:14:59.0082322Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0082507Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0083692Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0083752Z graph_break [] 2025-12-04T09:14:59.0083896Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0083963Z Autotune Choices Stats: 2025-12-04T09:14:59.0085355Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.0085599Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0085815Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0086135Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0087301Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0088414Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0089630Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0090740Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0091855Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0093030Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0093284Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.0093350Z Autotune Choices Stats: 2025-12-04T09:14:59.0094774Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0095207Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0095536Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0096137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0097403Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0098561Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0099730Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0100880Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0102037Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0103193Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0104341Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0105526Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0106758Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0107904Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0108163Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.0108294Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0108361Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0108424Z unimplemented [] 2025-12-04T09:14:59.0108523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0108705Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0109877Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0109938Z graph_break [] 2025-12-04T09:14:59.0110072Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0110136Z Autotune Choices Stats: 2025-12-04T09:14:59.0111517Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0111757Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0111976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0112293Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0113464Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0114702Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0115821Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0116937Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0118052Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0119171Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0119416Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.0119481Z Autotune Choices Stats: 2025-12-04T09:14:59.0120909Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.0121373Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0121700Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0122311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0123525Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0124678Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0125843Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0126992Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0128148Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0129305Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0130480Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0131720Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0132911Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0134065Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0134309Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.0134442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0134510Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0134579Z unimplemented [] 2025-12-04T09:14:59.0134680Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0134861Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0136035Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0136096Z graph_break [] 2025-12-04T09:14:59.0136227Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0136292Z Autotune Choices Stats: 2025-12-04T09:14:59.0137679Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0137920Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0138173Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0138491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0139720Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0140832Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0141952Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0143061Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0144183Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0145287Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0145535Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.0145599Z Autotune Choices Stats: 2025-12-04T09:14:59.0147053Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0147560Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0147917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0148473Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0149645Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0150796Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0151947Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0153103Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0158943Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0160271Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0161497Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0163282Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0164460Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0165622Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0165884Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.0166038Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0166112Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0166184Z unimplemented [] 2025-12-04T09:14:59.0166296Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0166489Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0167689Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0167758Z graph_break [] 2025-12-04T09:14:59.0167903Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0167972Z Autotune Choices Stats: 2025-12-04T09:14:59.0169474Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0169834Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0170058Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0170389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0171568Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0172761Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0173881Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0174999Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0176112Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0177225Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0177475Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.0177544Z Autotune Choices Stats: 2025-12-04T09:14:59.0179025Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.0179547Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0179888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0180456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0181626Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0182780Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0183945Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0185095Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0186299Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0187449Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0188729Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0189884Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0191029Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0192174Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0192424Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.0192564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0192634Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0192704Z unimplemented [] 2025-12-04T09:14:59.0192811Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0192998Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0194182Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0194242Z graph_break [] 2025-12-04T09:14:59.0194374Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0194442Z Autotune Choices Stats: 2025-12-04T09:14:59.0195868Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0196254Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0196476Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0196796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0197920Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0199039Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0200149Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0201270Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0202382Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0203525Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0203775Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.0203905Z Autotune Choices Stats: 2025-12-04T09:14:59.0205362Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.0205799Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0206130Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0206694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0207861Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0209019Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0210170Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0211314Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0212558Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0213810Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0214956Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0216110Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0217254Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0218413Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0218661Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.0218801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0218874Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0218937Z unimplemented [] 2025-12-04T09:14:59.0219047Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0219233Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0220448Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0220515Z graph_break [] 2025-12-04T09:14:59.0220654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0220796Z Autotune Choices Stats: 2025-12-04T09:14:59.0222209Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0222459Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0222675Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0223001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0224131Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0225241Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0226350Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0227471Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0228631Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0229745Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0230052Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.0230153Z Autotune Choices Stats: 2025-12-04T09:14:59.0231581Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0232012Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0232341Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0232902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0234059Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0235218Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0236371Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0237551Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0238815Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0239957Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0241107Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0242260Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0243403Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0244561Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0244807Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.0244946Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0245017Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0245081Z unimplemented [] 2025-12-04T09:14:59.0245188Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0245372Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0246583Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0246722Z graph_break [] 2025-12-04T09:14:59.0246849Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0246954Z Autotune Choices Stats: 2025-12-04T09:14:59.0248336Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.0248585Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0248804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0249127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0250254Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0251372Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0252570Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0253693Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0254847Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0256052Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0256300Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.0256366Z Autotune Choices Stats: 2025-12-04T09:14:59.0257793Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0258229Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0258555Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0259124Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0260286Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0261442Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0262848Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0264026Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0265337Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0266486Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0267639Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0268791Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0269935Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0271098Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0271347Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.0271521Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0271592Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0271655Z unimplemented [] 2025-12-04T09:14:59.0271767Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0272019Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0273243Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0273306Z graph_break [] 2025-12-04T09:14:59.0273433Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0273506Z Autotune Choices Stats: 2025-12-04T09:14:59.0274897Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0275151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0275371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0275696Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0276834Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0277952Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0279063Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0280234Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0281463Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0282572Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0282824Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.0282891Z Autotune Choices Stats: 2025-12-04T09:14:59.0284314Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0284752Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0285080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0285653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0286817Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0287969Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0289152Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0290401Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0291549Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0292760Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0293922Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0295078Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0296229Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0297425Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0297735Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.0297873Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0297942Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0298005Z unimplemented [] 2025-12-04T09:14:59.0298113Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0298330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0299534Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0299599Z graph_break [] 2025-12-04T09:14:59.0299728Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0299800Z Autotune Choices Stats: 2025-12-04T09:14:59.0301193Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0301446Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0301664Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0301984Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0303113Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0304227Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0305377Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0306504Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0307709Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0308836Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0309092Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.0309159Z Autotune Choices Stats: 2025-12-04T09:14:59.0310584Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0311027Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0311353Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0311918Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0313078Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0314283Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0315532Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0316686Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0317841Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0319003Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0320159Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0321311Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0322489Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0323644Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0323994Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.0324133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0324202Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0324267Z unimplemented [] 2025-12-04T09:14:59.0324375Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0324557Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0325732Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0325798Z graph_break [] 2025-12-04T09:14:59.0325927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0326003Z Autotune Choices Stats: 2025-12-04T09:14:59.0327388Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0327638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0327854Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0328176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0329298Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0330429Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0331581Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0332891Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0334002Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0335123Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0335371Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.0335438Z Autotune Choices Stats: 2025-12-04T09:14:59.0336863Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.0337303Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0337627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0338198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0339394Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0340578Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0341809Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0342960Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0344118Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0345279Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0346433Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0347590Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0348765Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0350003Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0350249Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.0350385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0350459Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0350521Z unimplemented [] 2025-12-04T09:14:59.0350631Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0350812Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0351992Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0352056Z graph_break [] 2025-12-04T09:14:59.0352184Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0352258Z Autotune Choices Stats: 2025-12-04T09:14:59.0353642Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0353902Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0354125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0354447Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0355570Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0356724Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0357926Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0359050Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0360160Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0361277Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0361519Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.0361592Z Autotune Choices Stats: 2025-12-04T09:14:59.0363203Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0363648Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0363983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0364619Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0365794Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0367094Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0368250Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0369408Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0370558Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0371712Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0372910Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0374092Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0375351Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0376498Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0376743Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.0376877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0376944Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0377008Z unimplemented [] 2025-12-04T09:14:59.0377114Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0377296Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0378473Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0378538Z graph_break [] 2025-12-04T09:14:59.0378665Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0378736Z Autotune Choices Stats: 2025-12-04T09:14:59.0380118Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0380369Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0380584Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0380906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0382064Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0383271Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0384383Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0385502Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0386612Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0387726Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0387967Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.0388053Z Autotune Choices Stats: 2025-12-04T09:14:59.0389482Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.0389922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0390280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0390859Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0392277Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0393445Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0394603Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0395742Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0396893Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0398038Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0399235Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0400383Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0401624Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0402777Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0403023Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.0403162Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0403233Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0403296Z unimplemented [] 2025-12-04T09:14:59.0403405Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0403589Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0404763Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0404829Z graph_break [] 2025-12-04T09:14:59.0404955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0405030Z Autotune Choices Stats: 2025-12-04T09:14:59.0406419Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0406672Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0406889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0407246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0408372Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0409602Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0410711Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0411842Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0412997Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0414119Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0414363Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.0414441Z Autotune Choices Stats: 2025-12-04T09:14:59.0415899Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0416337Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0416723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0417326Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0418482Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0419643Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0420788Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0421943Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0423093Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0424244Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0425424Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0426660Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0427814Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0428969Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0429220Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.0429357Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0429437Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0429501Z unimplemented [] 2025-12-04T09:14:59.0429605Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0429788Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0430956Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0431022Z graph_break [] 2025-12-04T09:14:59.0431149Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0431221Z Autotune Choices Stats: 2025-12-04T09:14:59.0432642Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.0432892Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0433108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0433482Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0434645Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0435765Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0436881Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0437990Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0439113Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0440222Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0440465Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.0440538Z Autotune Choices Stats: 2025-12-04T09:14:59.0442008Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0442521Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0442876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0443442Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0444602Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0445764Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0446912Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0448064Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0449217Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0450397Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0451643Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0452852Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0454010Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0455164Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0455409Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.0455539Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0455613Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0455676Z unimplemented [] 2025-12-04T09:14:59.0455780Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0455967Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0457140Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0457207Z graph_break [] 2025-12-04T09:14:59.0457335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0457406Z Autotune Choices Stats: 2025-12-04T09:14:59.0458822Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.0459129Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0459377Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0459692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0460823Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0462055Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0463180Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0464306Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0465415Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0466533Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0466834Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.0466910Z Autotune Choices Stats: 2025-12-04T09:14:59.0468432Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0468977Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0469302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0470260Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0471442Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0472601Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0473752Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0474906Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0476113Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0477287Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0478500Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0479655Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0480810Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0481962Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0482208Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.0482344Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0482421Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0482485Z unimplemented [] 2025-12-04T09:14:59.0482597Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0482779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0483949Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0484017Z graph_break [] 2025-12-04T09:14:59.0484180Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0484254Z Autotune Choices Stats: 2025-12-04T09:14:59.0485675Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0485987Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0486206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0486517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0487647Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0488758Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0489866Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0490981Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0492092Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0493326Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0493633Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.0493705Z Autotune Choices Stats: 2025-12-04T09:14:59.0495154Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.0495590Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0495918Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0496485Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0497815Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0498998Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0500149Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0501347Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0502499Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0503756Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0504905Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0506048Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0507204Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0508357Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0508604Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.0508733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0508809Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0508871Z unimplemented [] 2025-12-04T09:14:59.0508973Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0509162Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0510367Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0510508Z graph_break [] 2025-12-04T09:14:59.0510639Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0510710Z Autotune Choices Stats: 2025-12-04T09:14:59.0512129Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0512379Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0512601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0512913Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0514050Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0515156Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0516270Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0517384Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0518529Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0519687Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0519991Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.0520061Z Autotune Choices Stats: 2025-12-04T09:14:59.0521490Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.0521941Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0522270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0522836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0524000Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0525159Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0526314Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0527506Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0528763Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0529921Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0531076Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0532222Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0533434Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0534584Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0534833Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.0534962Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0535039Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0535101Z unimplemented [] 2025-12-04T09:14:59.0535242Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0535438Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0536612Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0536799Z graph_break [] 2025-12-04T09:14:59.0536927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0537003Z Autotune Choices Stats: 2025-12-04T09:14:59.0538391Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0538641Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0538858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0539172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0540308Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0541425Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0542539Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0543693Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0544803Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0546014Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0546255Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:59.0546341Z Autotune Choices Stats: 2025-12-04T09:14:59.0547765Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.0548204Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0548528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0549093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0550256Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0551413Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0552599Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0553861Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0555015Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0556163Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0557312Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0558467Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0559617Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0560796Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0561043Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:59.0561172Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0561311Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0561373Z unimplemented [] 2025-12-04T09:14:59.0561474Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0561672Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0563089Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0563166Z graph_break [] 2025-12-04T09:14:59.0563298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0563376Z Autotune Choices Stats: 2025-12-04T09:14:59.0564777Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.0565026Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0565245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0565560Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0566695Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0567807Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0568931Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0570124Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0571363Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0572544Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0572807Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:59.0572881Z Autotune Choices Stats: 2025-12-04T09:14:59.0574317Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.0574756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0575084Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0575650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0576815Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0578016Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0579165Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0580411Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0581567Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0582727Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0583887Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0585036Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0586196Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0587379Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0587690Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:59.0587821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0587898Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0587993Z unimplemented [] 2025-12-04T09:14:59.0588100Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0588286Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0589462Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0589533Z graph_break [] 2025-12-04T09:14:59.0589659Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0589726Z Autotune Choices Stats: 2025-12-04T09:14:59.0591125Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.0591365Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0591587Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0591904Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0593032Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0594149Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0595322Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0596550Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0597666Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0598794Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0599034Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:59.0599116Z Autotune Choices Stats: 2025-12-04T09:14:59.0600537Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:59.0600976Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0601302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0601865Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0603032Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0604267Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0605511Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0606663Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0607814Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0608974Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0610147Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0611298Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0612532Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0613773Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0614020Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:59.0614148Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0614224Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0614286Z unimplemented [] 2025-12-04T09:14:59.0614386Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0614573Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0615745Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0615811Z graph_break [] 2025-12-04T09:14:59.0615935Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0616003Z Autotune Choices Stats: 2025-12-04T09:14:59.0617391Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0617636Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0617860Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0618172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0619301Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0620458Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0621573Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0622800Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0623914Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0625038Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0625284Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:59.0625356Z Autotune Choices Stats: 2025-12-04T09:14:59.0626777Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.0627213Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0627544Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0628104Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0629304Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0630577Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0631742Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0632905Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0634052Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0635210Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0636363Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0637544Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0638698Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0639945Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0640198Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:59.0640334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0640409Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0640473Z unimplemented [] 2025-12-04T09:14:59.0640578Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0640790Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0641975Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0642043Z graph_break [] 2025-12-04T09:14:59.0642173Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0642241Z Autotune Choices Stats: 2025-12-04T09:14:59.0643627Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0643871Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0644093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0644411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0645581Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0646693Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0647913Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0649044Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0650174Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0651295Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0651542Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:59.0651615Z Autotune Choices Stats: 2025-12-04T09:14:59.0653079Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.0653522Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0653847Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0654440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0655646Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0656877Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0658042Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0659207Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0660369Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0661526Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0662802Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0664017Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0665330Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0666485Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0666737Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:59.0666871Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0666947Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0667010Z unimplemented [] 2025-12-04T09:14:59.0667125Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0667319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0668505Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0668577Z graph_break [] 2025-12-04T09:14:59.0668707Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0668773Z Autotune Choices Stats: 2025-12-04T09:14:59.0670167Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.0670412Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0670636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0670953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0672120Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0673325Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0674454Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0675574Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0676695Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0677819Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0678062Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:59.0678129Z Autotune Choices Stats: 2025-12-04T09:14:59.0679559Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.0680030Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0680353Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0680972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0682173Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0683337Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0684503Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0685657Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0686810Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0687966Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0689151Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0690398Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0691556Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0692743Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0693011Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:59.0693140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0693213Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0693279Z unimplemented [] 2025-12-04T09:14:59.0693380Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0693567Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0694745Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0694813Z graph_break [] 2025-12-04T09:14:59.0694940Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0695006Z Autotune Choices Stats: 2025-12-04T09:14:59.0696394Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.0696638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0696919Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0697232Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0698454Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0699563Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0700692Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0701807Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0702934Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0704064Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0704315Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:14:59.0704381Z Autotune Choices Stats: 2025-12-04T09:14:59.0705847Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.0706281Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0706697Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0707253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0708426Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0709582Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0710738Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0711901Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0713050Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0714238Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0715398Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0716640Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0717798Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0718945Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0719206Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:14:59.0719378Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:59.0719461Z Traceback (most recent call last): 2025-12-04T09:14:59.0719762Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:59.0719828Z self.assertTrue( 2025-12-04T09:14:59.0720035Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:59.0720116Z raise self.failureException(msg) 2025-12-04T09:14:59.0720364Z AssertionError: False is not true : Log file /tmp/tmpg4bk3jz1/flex_attention_configs.json was not created 2025-12-04T09:14:59.0720369Z 2025-12-04T09:14:59.0720522Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:59.0720777Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:59.0720781Z 2025-12-04T09:14:59.0720952Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:59.0721083Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0721152Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0721219Z unimplemented [] 2025-12-04T09:14:59.0721333Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0722561Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:59.0722804Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0722864Z graph_break [] 2025-12-04T09:14:59.0722997Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0724039Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:59.0724129Z current_size = base.storage().size() 2025-12-04T09:14:59.0724196Z Autotune Choices Stats: 2025-12-04T09:14:59.0725599Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.0725847Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0726067Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0726404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0727537Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0728655Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0729779Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0730928Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0732212Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0733388Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0733642Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:59.0733710Z Autotune Choices Stats: 2025-12-04T09:14:59.0735140Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.0735572Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0735905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0736466Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0737642Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0738791Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0739984Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0741215Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0742370Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0743522Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0744660Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0745808Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0746949Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0748137Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0748393Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:59.0748595Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0748667Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0748733Z unimplemented [] 2025-12-04T09:14:59.0748838Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0749067Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0750246Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0750309Z graph_break [] 2025-12-04T09:14:59.0750445Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0750511Z Autotune Choices Stats: 2025-12-04T09:14:59.0751907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0752152Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0752369Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0752695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0753820Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0754935Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0756090Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0757199Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0758443Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0759563Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0759816Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:59.0759883Z Autotune Choices Stats: 2025-12-04T09:14:59.0761312Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0761746Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0762233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0762796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0763977Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0765189Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0766488Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0767639Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0768807Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0769960Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0771105Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0772255Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0773440Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0774629Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0774938Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:59.0775105Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0775178Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0775246Z unimplemented [] 2025-12-04T09:14:59.0775361Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0775548Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0776722Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0776786Z graph_break [] 2025-12-04T09:14:59.0776922Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0776988Z Autotune Choices Stats: 2025-12-04T09:14:59.0778364Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0778615Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0778836Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0779162Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0780291Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0781410Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0782556Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0783784Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0784894Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0786004Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0786256Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:59.0786323Z Autotune Choices Stats: 2025-12-04T09:14:59.0787753Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0788186Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0788517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0789078Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0790284Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0791434Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0792697Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0793844Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0795003Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0796162Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0797306Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0798456Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0799640Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0800894Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0801139Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:59.0801273Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0801344Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0801410Z unimplemented [] 2025-12-04T09:14:59.0801515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0801699Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0802879Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0802939Z graph_break [] 2025-12-04T09:14:59.0803072Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0803139Z Autotune Choices Stats: 2025-12-04T09:14:59.0804514Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0804760Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0804978Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0805298Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0806425Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0807583Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0808783Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0809906Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0811021Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0812129Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0812434Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:59.0812501Z Autotune Choices Stats: 2025-12-04T09:14:59.0813929Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0814357Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0814686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0815247Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0816457Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0817702Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0818857Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0820009Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0821160Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0822318Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0823462Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0824685Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0825866Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0827078Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0827321Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:59.0827456Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0827525Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0827586Z unimplemented [] 2025-12-04T09:14:59.0827696Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0827880Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0829052Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0829115Z graph_break [] 2025-12-04T09:14:59.0829250Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0829324Z Autotune Choices Stats: 2025-12-04T09:14:59.0830706Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0830951Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0831169Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0831489Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0832650Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0833778Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0835224Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0836353Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0837471Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0838582Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0838831Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:59.0838900Z Autotune Choices Stats: 2025-12-04T09:14:59.0840321Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0840754Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0841117Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0841676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0842935Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0844083Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0845232Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0846381Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0847529Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0848677Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0849873Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0851037Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0852333Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0853489Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0853736Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:59.0853869Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0853937Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0854001Z unimplemented [] 2025-12-04T09:14:59.0854112Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0854296Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0855473Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0855536Z graph_break [] 2025-12-04T09:14:59.0855667Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0855736Z Autotune Choices Stats: 2025-12-04T09:14:59.0857121Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.0857372Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0857592Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0857910Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0859065Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0860283Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0861399Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0862665Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0863786Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0864903Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0865160Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:59.0865229Z Autotune Choices Stats: 2025-12-04T09:14:59.0866657Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0867152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0867478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0868184Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0869420Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0870585Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0871745Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0872909Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0874065Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0875218Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0876409Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0877654Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0878803Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0879964Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0880211Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:59.0880348Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0880420Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0880482Z unimplemented [] 2025-12-04T09:14:59.0880591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0880774Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0881956Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0882018Z graph_break [] 2025-12-04T09:14:59.0882154Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0882232Z Autotune Choices Stats: 2025-12-04T09:14:59.0883629Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0883913Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0884131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0884451Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0885693Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0886813Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0887937Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0889056Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0890183Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0891296Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0891551Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:59.0891618Z Autotune Choices Stats: 2025-12-04T09:14:59.0893129Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.0893630Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0893986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0894551Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0895716Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0896888Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0898044Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0899201Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0900360Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0901539Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0902794Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0903955Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0905104Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0906264Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0906515Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:59.0906652Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0906721Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0906785Z unimplemented [] 2025-12-04T09:14:59.0906896Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0907080Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0908261Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0908325Z graph_break [] 2025-12-04T09:14:59.0908458Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0908524Z Autotune Choices Stats: 2025-12-04T09:14:59.0909948Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.0910260Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0910479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0910844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0911969Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0913090Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0914209Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0915338Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0916455Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0917575Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0917820Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:59.0917886Z Autotune Choices Stats: 2025-12-04T09:14:59.0919353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0919892Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0920226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0920791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0921966Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0923127Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0924291Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0925444Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0926639Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0927783Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0929047Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0930197Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0931352Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0932586Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0932839Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.0932978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0933050Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0933112Z unimplemented [] 2025-12-04T09:14:59.0933224Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0933412Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0934594Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0934656Z graph_break [] 2025-12-04T09:14:59.0934784Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0934898Z Autotune Choices Stats: 2025-12-04T09:14:59.0936288Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0936632Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0936851Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0937177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0938302Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0939433Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0940543Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0941670Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0942792Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0943937Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0944187Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.0944317Z Autotune Choices Stats: 2025-12-04T09:14:59.0945776Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0946217Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0946547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0947112Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0948275Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0949437Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0950596Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0951748Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0952974Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0954219Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0955378Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0956544Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0957691Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0958849Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0959102Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.0959250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0959321Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0959383Z unimplemented [] 2025-12-04T09:14:59.0959498Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0959683Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0960911Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0960975Z graph_break [] 2025-12-04T09:14:59.0961168Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0961244Z Autotune Choices Stats: 2025-12-04T09:14:59.0962895Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.0963159Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0963388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0963717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0964861Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0965980Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0967099Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0968223Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0969404Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0970528Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0970896Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.0970966Z Autotune Choices Stats: 2025-12-04T09:14:59.0972433Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0972882Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0973210Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0973777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0974948Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0976108Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0977267Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0978450Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0979734Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0980880Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.0982033Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.0983187Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.0984338Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0985493Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.0985741Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.0985879Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.0985948Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.0986011Z unimplemented [] 2025-12-04T09:14:59.0986119Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.0986339Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.0987523Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.0987647Z graph_break [] 2025-12-04T09:14:59.0987807Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.0987879Z Autotune Choices Stats: 2025-12-04T09:14:59.0989265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.0989515Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0989731Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.0990067Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.0991200Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0992322Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0993438Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.0994559Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.0995707Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0996925Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.0997177Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.0997244Z Autotune Choices Stats: 2025-12-04T09:14:59.0998661Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.0999103Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.0999436Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1000004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1001174Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1002332Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1003524Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1004674Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1005967Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1007117Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1008271Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1009432Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1010589Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1011748Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1012027Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.1012163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1012232Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1012333Z unimplemented [] 2025-12-04T09:14:59.1012525Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1012708Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1013920Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1013983Z graph_break [] 2025-12-04T09:14:59.1014109Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1014181Z Autotune Choices Stats: 2025-12-04T09:14:59.1015567Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.1015820Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1016038Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1016358Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1017488Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1018602Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1019720Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1020876Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1022074Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1023195Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1023443Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.1023518Z Autotune Choices Stats: 2025-12-04T09:14:59.1024946Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1025387Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1025713Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1026279Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1027451Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1028618Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1029812Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1031056Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1032214Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1033377Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1034540Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1035703Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1036854Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1038037Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1038342Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.1038478Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1038547Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1038610Z unimplemented [] 2025-12-04T09:14:59.1038720Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1038940Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1040130Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1040197Z graph_break [] 2025-12-04T09:14:59.1040324Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1040397Z Autotune Choices Stats: 2025-12-04T09:14:59.1041781Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1042031Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1042252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1042574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1043699Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1044821Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1045979Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1047100Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1048320Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1049441Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1049687Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.1049759Z Autotune Choices Stats: 2025-12-04T09:14:59.1051182Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.1051620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1051943Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1052599Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1053774Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1054971Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1056228Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1057381Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1058539Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1059686Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1060846Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1062158Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1063393Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1064568Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1064946Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.1065084Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1065155Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1065217Z unimplemented [] 2025-12-04T09:14:59.1065328Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1065515Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1066690Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1066756Z graph_break [] 2025-12-04T09:14:59.1066884Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1066961Z Autotune Choices Stats: 2025-12-04T09:14:59.1068348Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1068600Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1068822Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1069144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1070273Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1071442Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1072561Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1073771Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1074886Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1076013Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1076254Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.1076330Z Autotune Choices Stats: 2025-12-04T09:14:59.1077751Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1078186Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1078509Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1079077Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1080300Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1081556Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1082708Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1083870Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1085026Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1086181Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1087356Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1088514Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1089703Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1090949Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1091197Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.1091334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1091407Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1091469Z unimplemented [] 2025-12-04T09:14:59.1091583Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1091769Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1092987Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1093060Z graph_break [] 2025-12-04T09:14:59.1093189Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1093268Z Autotune Choices Stats: 2025-12-04T09:14:59.1094650Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1094901Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1095118Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1095444Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1096575Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1097738Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1098981Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1100121Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1101237Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1102364Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1102609Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.1102683Z Autotune Choices Stats: 2025-12-04T09:14:59.1104100Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.1104537Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1104861Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1105458Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1106629Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1107926Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1109079Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1110243Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1111400Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1112552Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1113709Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1114891Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1116147Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1117301Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1117544Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.1117672Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1117749Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1117810Z unimplemented [] 2025-12-04T09:14:59.1117915Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1118098Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1119269Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1119339Z graph_break [] 2025-12-04T09:14:59.1119464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1119536Z Autotune Choices Stats: 2025-12-04T09:14:59.1120931Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1121182Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1121398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1121726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1122898Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1124116Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1125237Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1126367Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1127479Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1128603Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1128846Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.1128920Z Autotune Choices Stats: 2025-12-04T09:14:59.1130339Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.1130797Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1131158Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1131722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1133038Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1134205Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1135361Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1136522Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1137683Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1138836Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1140071Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1141247Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1142475Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1143633Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1143889Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.1144022Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1144099Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1144163Z unimplemented [] 2025-12-04T09:14:59.1144268Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1144454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1145630Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1145699Z graph_break [] 2025-12-04T09:14:59.1145828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1145902Z Autotune Choices Stats: 2025-12-04T09:14:59.1147283Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1147533Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1153914Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1154369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1155552Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1156834Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1157970Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1159103Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1160212Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1161339Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1161596Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.1161676Z Autotune Choices Stats: 2025-12-04T09:14:59.1163378Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1163841Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1164270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1164896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1166078Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1167255Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1168406Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1169574Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1170742Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1171955Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1173182Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1174435Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1175591Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1176746Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1177006Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.1177153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1177236Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1177300Z unimplemented [] 2025-12-04T09:14:59.1177407Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1177606Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1178795Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1178866Z graph_break [] 2025-12-04T09:14:59.1178998Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1179069Z Autotune Choices Stats: 2025-12-04T09:14:59.1180505Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.1180760Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1180981Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1181361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1182535Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1183642Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1184764Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1185876Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1186992Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1188108Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1188357Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.1188430Z Autotune Choices Stats: 2025-12-04T09:14:59.1189900Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1190398Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1190755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1191333Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1192507Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1193668Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1194823Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1195983Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1197140Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1198326Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1199594Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1200748Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1201902Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1203056Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1203305Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.1203435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1203512Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1203577Z unimplemented [] 2025-12-04T09:14:59.1203683Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1203877Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1205049Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1205121Z graph_break [] 2025-12-04T09:14:59.1205252Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1205336Z Autotune Choices Stats: 2025-12-04T09:14:59.1206769Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1207083Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1207333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1207654Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1208793Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1209912Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1211031Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1212158Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1213318Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1214443Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1214725Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.1214807Z Autotune Choices Stats: 2025-12-04T09:14:59.1216269Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1216770Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1217095Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1217667Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1218835Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1220006Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1221161Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1222313Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1223518Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1224758Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1225915Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1227066Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1228241Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1229402Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1229659Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.1229790Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1229869Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1229937Z unimplemented [] 2025-12-04T09:14:59.1230042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1230233Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1231405Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1231513Z graph_break [] 2025-12-04T09:14:59.1231644Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1231711Z Autotune Choices Stats: 2025-12-04T09:14:59.1233140Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1233521Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1233742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1234057Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1235193Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1236314Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1237444Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1238562Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1239684Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1240839Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1241141Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.1241212Z Autotune Choices Stats: 2025-12-04T09:14:59.1242663Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1243104Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1243429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1243997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1245163Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1246321Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1247467Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1248655Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1249799Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1251044Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1252201Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1253388Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1254539Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1255691Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1255941Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.1256071Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1256144Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1256208Z unimplemented [] 2025-12-04T09:14:59.1256311Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1256502Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1257714Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1257841Z graph_break [] 2025-12-04T09:14:59.1257969Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1258036Z Autotune Choices Stats: 2025-12-04T09:14:59.1259456Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1259698Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1259925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1260243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1261374Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1262602Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1263737Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1264859Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1266033Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1267303Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1267549Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.1267622Z Autotune Choices Stats: 2025-12-04T09:14:59.1269053Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.1269491Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1269814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1270381Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1271545Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1272705Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1273854Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1275048Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1276288Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1277447Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1278607Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1279753Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1280909Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1282057Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1282317Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.1282446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1282523Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1282621Z unimplemented [] 2025-12-04T09:14:59.1282726Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1282917Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1284454Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1284532Z graph_break [] 2025-12-04T09:14:59.1284662Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1284730Z Autotune Choices Stats: 2025-12-04T09:14:59.1286128Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1286377Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1286604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1286925Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1288060Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1289174Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1290293Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1291457Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1292667Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1294325Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1294663Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.1294760Z Autotune Choices Stats: 2025-12-04T09:14:59.1296694Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1297296Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1297737Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1298514Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1300048Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1301211Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1302413Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1303663Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1304808Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1305964Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1307116Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1308270Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1309440Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1310630Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1310885Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.1311015Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1311156Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1311219Z unimplemented [] 2025-12-04T09:14:59.1311320Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1311508Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1312714Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1312782Z graph_break [] 2025-12-04T09:14:59.1312910Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1312978Z Autotune Choices Stats: 2025-12-04T09:14:59.1314371Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1314615Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1314836Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1315153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1316279Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1317394Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1318512Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1319659Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1320878Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1322003Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1322247Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.1322319Z Autotune Choices Stats: 2025-12-04T09:14:59.1323745Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.1324183Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1324508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1325071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1326262Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1327467Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1328652Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1329899Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1331056Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1332220Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1333432Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1334584Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1335739Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1336928Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1337249Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.1337381Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1337539Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1337605Z unimplemented [] 2025-12-04T09:14:59.1337712Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1337910Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1339096Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1339169Z graph_break [] 2025-12-04T09:14:59.1339298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1339371Z Autotune Choices Stats: 2025-12-04T09:14:59.1340783Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1341032Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1341263Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1341584Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1342721Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1343835Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1344990Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1346194Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1347322Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1348439Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1348684Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.1348752Z Autotune Choices Stats: 2025-12-04T09:14:59.1350191Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1350639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1350967Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1351534Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1352738Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1353900Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1355148Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1356308Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1357454Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1358615Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1359776Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1360927Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1362283Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1363597Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1363866Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.1363998Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1364074Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1364137Z unimplemented [] 2025-12-04T09:14:59.1364239Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1364430Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1365605Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1365672Z graph_break [] 2025-12-04T09:14:59.1365800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1365869Z Autotune Choices Stats: 2025-12-04T09:14:59.1367276Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.1367526Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1367750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1368066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1369212Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1370359Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1371635Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1372784Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1373909Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1375029Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1375277Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.1375343Z Autotune Choices Stats: 2025-12-04T09:14:59.1376790Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1377228Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1377561Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1378120Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1379342Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1380598Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1383883Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1386608Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1387862Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1389040Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1390192Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1391342Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1392490Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1393730Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1393989Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.1394130Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1394280Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1394352Z unimplemented [] 2025-12-04T09:14:59.1394459Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1394655Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1395898Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1395971Z graph_break [] 2025-12-04T09:14:59.1396107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1396174Z Autotune Choices Stats: 2025-12-04T09:14:59.1397577Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.1397825Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1398050Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1398366Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1399500Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1400617Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1401810Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1402965Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1404122Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1405241Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1405489Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.1405557Z Autotune Choices Stats: 2025-12-04T09:14:59.1406999Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1407434Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1407762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1408335Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1409570Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1410743Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1411929Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1413200Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1414361Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1415503Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1416654Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1417796Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1419008Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1420194Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1420446Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.1420580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1420660Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1420724Z unimplemented [] 2025-12-04T09:14:59.1420868Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1421068Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1422248Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1422319Z graph_break [] 2025-12-04T09:14:59.1422446Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1422524Z Autotune Choices Stats: 2025-12-04T09:14:59.1423908Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1424164Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1424389Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1424703Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1425843Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1427018Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1428184Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1429347Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1430452Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1431564Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1431805Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.1431875Z Autotune Choices Stats: 2025-12-04T09:14:59.1433296Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.1433733Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1434054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1434681Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1435846Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1437060Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1438253Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1439412Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1440566Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1441710Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1442858Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1444064Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1445214Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1446403Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1446682Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.1446812Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1446886Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1446951Z unimplemented [] 2025-12-04T09:14:59.1447051Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1447240Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1448410Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1448478Z graph_break [] 2025-12-04T09:14:59.1448617Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1448692Z Autotune Choices Stats: 2025-12-04T09:14:59.1450089Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1450334Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1450554Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1450864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1452067Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1453218Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1454371Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1455538Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1456651Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1457768Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1458011Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.1458084Z Autotune Choices Stats: 2025-12-04T09:14:59.1459509Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.1460029Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1460355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1460917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1462371Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1463642Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1464809Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1465984Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1467144Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1468289Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1469447Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1470690Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1471877Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1473058Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1473317Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.1473451Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1473526Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1473591Z unimplemented [] 2025-12-04T09:14:59.1473693Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1473884Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1475062Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1475138Z graph_break [] 2025-12-04T09:14:59.1475276Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1475349Z Autotune Choices Stats: 2025-12-04T09:14:59.1476735Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1477058Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1477281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1477597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1478725Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1479882Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1481039Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1482174Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1483286Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1484406Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1484648Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:59.1484721Z Autotune Choices Stats: 2025-12-04T09:14:59.1486148Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.1486653Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1486980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1487546Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1488744Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1489931Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1491081Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1492231Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1493444Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1494599Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1495840Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1497020Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1498206Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1499347Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1499614Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:59.1499741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1499821Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1499881Z unimplemented [] 2025-12-04T09:14:59.1499982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1500169Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1501345Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1501412Z graph_break [] 2025-12-04T09:14:59.1501536Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1501601Z Autotune Choices Stats: 2025-12-04T09:14:59.1502989Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.1503301Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1503522Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1503835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1504991Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1506196Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1507317Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1508438Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1509546Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1510664Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1510909Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:59.1511049Z Autotune Choices Stats: 2025-12-04T09:14:59.1512471Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.1512909Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1513271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1513832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1515029Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1516186Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1517329Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1518485Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1519636Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1520850Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1522031Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1523231Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1524381Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1525552Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1525802Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:59.1525932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1526008Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1526070Z unimplemented [] 2025-12-04T09:14:59.1526172Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1526367Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1527540Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1527610Z graph_break [] 2025-12-04T09:14:59.1527735Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1527868Z Autotune Choices Stats: 2025-12-04T09:14:59.1529269Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.1529510Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1529732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1530099Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1531243Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1532466Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1533629Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1534756Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1535868Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1536986Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1537298Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:59.1537372Z Autotune Choices Stats: 2025-12-04T09:14:59.1538803Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:59.1539287Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1539615Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1540211Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1541376Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1542536Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1543690Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1544846Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1546058Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1547213Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1548397Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1549578Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1550732Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1551885Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1552137Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:59.1552276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1552352Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1552412Z unimplemented [] 2025-12-04T09:14:59.1552512Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1552701Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1553872Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1554004Z graph_break [] 2025-12-04T09:14:59.1554130Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1554199Z Autotune Choices Stats: 2025-12-04T09:14:59.1555590Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1555892Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1556117Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1556432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1557594Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1558710Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1559823Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1560938Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1562251Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1563506Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1563758Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:59.1563831Z Autotune Choices Stats: 2025-12-04T09:14:59.1565307Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.1565750Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1566120Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1566683Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1567862Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1569018Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1570173Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1571344Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1572613Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1573804Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1574988Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1576136Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1577290Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1578438Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1578691Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:59.1578823Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1578899Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1578963Z unimplemented [] 2025-12-04T09:14:59.1579062Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1579323Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1580495Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1580570Z graph_break [] 2025-12-04T09:14:59.1580703Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1580769Z Autotune Choices Stats: 2025-12-04T09:14:59.1582196Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1582442Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1582695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1583009Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1584144Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1585250Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1586371Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1587478Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1588682Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1589811Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1590053Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:59.1590153Z Autotune Choices Stats: 2025-12-04T09:14:59.1591621Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.1592058Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1592388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1592949Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1594115Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1595276Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1596425Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1597662Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1598843Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1600040Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1601199Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1602346Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1603497Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1604649Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1604968Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:59.1605098Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1605179Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1605247Z unimplemented [] 2025-12-04T09:14:59.1605353Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1605546Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1606719Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1606791Z graph_break [] 2025-12-04T09:14:59.1606921Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1607022Z Autotune Choices Stats: 2025-12-04T09:14:59.1608446Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.1608688Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1608912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1609227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1610355Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1611472Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1612643Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1613760Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1614972Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1616122Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1616369Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:59.1616437Z Autotune Choices Stats: 2025-12-04T09:14:59.1617898Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.1618337Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1618660Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1619214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1620383Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1621537Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1622769Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1623925Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1625101Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1626285Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1627437Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1628577Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1629730Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1630874Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1631184Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:59.1631311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1631385Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1631448Z unimplemented [] 2025-12-04T09:14:59.1631549Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1631750Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1632960Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1633029Z graph_break [] 2025-12-04T09:14:59.1633157Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1633221Z Autotune Choices Stats: 2025-12-04T09:14:59.1634644Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.1634889Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1635111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1635434Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1636573Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1637682Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1638805Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1639989Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1641177Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1642326Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1642569Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:14:59.1642634Z Autotune Choices Stats: 2025-12-04T09:14:59.1644069Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.1644501Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1644844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1645406Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1646572Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1647726Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1648942Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1650129Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1651322Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1652538Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1653715Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1654894Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1656051Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1657264Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1657525Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:14:59.1657664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1657748Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1657811Z unimplemented [] 2025-12-04T09:14:59.1657917Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1658158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1659384Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1659456Z graph_break [] 2025-12-04T09:14:59.1659589Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1659659Z Autotune Choices Stats: 2025-12-04T09:14:59.1661074Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.1661323Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1661551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1662045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1663204Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1664323Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1665573Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1666689Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1667864Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1669048Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1669312Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:14:59.1669382Z Autotune Choices Stats: 2025-12-04T09:14:59.1670822Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:14:59.1671265Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1671612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1672180Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1673352Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1674571Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1675755Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1676942Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1678087Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1679257Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1680407Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1681557Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1682780Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1683935Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1684239Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:14:59.1684418Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:59.1684497Z Traceback (most recent call last): 2025-12-04T09:14:59.1684806Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:59.1684874Z self.assertTrue( 2025-12-04T09:14:59.1685083Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:59.1685207Z raise self.failureException(msg) 2025-12-04T09:14:59.1685451Z AssertionError: False is not true : Log file /tmp/tmpvh1jtdth/flex_attention_configs.json was not created 2025-12-04T09:14:59.1685456Z 2025-12-04T09:14:59.1685613Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:59.1685869Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:59.1685873Z 2025-12-04T09:14:59.1686044Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:59.1686180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1686252Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1686319Z unimplemented [] 2025-12-04T09:14:59.1686425Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1687623Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:59.1687812Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1687872Z graph_break [] 2025-12-04T09:14:59.1688011Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1689011Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:59.1689098Z current_size = base.storage().size() 2025-12-04T09:14:59.1689166Z Autotune Choices Stats: 2025-12-04T09:14:59.1690559Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.1690879Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1691099Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1691426Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1692691Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1693853Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1694978Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1696089Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1697208Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1698318Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1698637Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:59.1698707Z Autotune Choices Stats: 2025-12-04T09:14:59.1700154Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.1700595Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1700965Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1701519Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1702722Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1703877Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1705032Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1706181Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1707327Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1708550Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1709724Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1710904Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1712053Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1713199Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1713450Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:59.1713590Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1713662Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1713731Z unimplemented [] 2025-12-04T09:14:59.1713838Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1714023Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1715210Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1715370Z graph_break [] 2025-12-04T09:14:59.1715508Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1715578Z Autotune Choices Stats: 2025-12-04T09:14:59.1716964Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1717214Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1717436Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1717802Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1718968Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1720082Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1721199Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1722330Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1723450Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1724561Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1724874Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:59.1724942Z Autotune Choices Stats: 2025-12-04T09:14:59.1726404Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1726838Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1727218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1727777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1728944Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1730102Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1731260Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1732467Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1733894Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1735080Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1736264Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1737412Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1738557Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1739709Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1739958Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:59.1740094Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1740167Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1740241Z unimplemented [] 2025-12-04T09:14:59.1740345Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1740531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1741714Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1741844Z graph_break [] 2025-12-04T09:14:59.1741989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1742061Z Autotune Choices Stats: 2025-12-04T09:14:59.1743484Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1743737Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1743953Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1744275Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1745433Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1746556Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1747671Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1748786Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1749904Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1751100Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1751346Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:59.1751412Z Autotune Choices Stats: 2025-12-04T09:14:59.1752890Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1753357Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1753690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1754253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1755422Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1756573Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1757736Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1758944Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1760096Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1761272Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1762603Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1763786Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1764940Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1766096Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1766343Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:59.1766481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1766646Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1766713Z unimplemented [] 2025-12-04T09:14:59.1766824Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1767011Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1768196Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1768257Z graph_break [] 2025-12-04T09:14:59.1768391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1768459Z Autotune Choices Stats: 2025-12-04T09:14:59.1769902Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1770185Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1770406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1770727Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1771861Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1773045Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1774162Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1775279Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1776503Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1777618Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1777903Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:59.1777971Z Autotune Choices Stats: 2025-12-04T09:14:59.1779459Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1779895Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1780229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1780787Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1781949Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1783113Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1784276Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1785492Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1786679Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1787853Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1789011Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1790158Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1791305Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1792455Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1792772Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:59.1792914Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1792987Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1793051Z unimplemented [] 2025-12-04T09:14:59.1793160Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1793344Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1794527Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1794589Z graph_break [] 2025-12-04T09:14:59.1794750Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1794824Z Autotune Choices Stats: 2025-12-04T09:14:59.1796244Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1796503Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1796721Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1797039Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1798163Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1799288Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1800406Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1801594Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1802710Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1803856Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1804108Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:59.1804173Z Autotune Choices Stats: 2025-12-04T09:14:59.1805629Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1806072Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1806399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1806962Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1808125Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1809285Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1810499Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1811689Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1812934Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1814081Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1815228Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1816381Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1817523Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1818673Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1818985Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:59.1819119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1819188Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1819252Z unimplemented [] 2025-12-04T09:14:59.1819360Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1819546Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1820761Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1820825Z graph_break [] 2025-12-04T09:14:59.1820951Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1821027Z Autotune Choices Stats: 2025-12-04T09:14:59.1822441Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.1822699Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1822913Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1823235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1824363Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1825484Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1826593Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1827783Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1828930Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1830079Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1830324Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:59.1830391Z Autotune Choices Stats: 2025-12-04T09:14:59.1831816Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1832265Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1832590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1833157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1834324Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1835578Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1836743Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1837923Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1839114Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1840263Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1841417Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1842572Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1843710Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1844946Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1845190Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:59.1845336Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1845412Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1845474Z unimplemented [] 2025-12-04T09:14:59.1845615Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1845797Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1847007Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1847068Z graph_break [] 2025-12-04T09:14:59.1847192Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1847266Z Autotune Choices Stats: 2025-12-04T09:14:59.1848657Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1848905Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1849129Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1849455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1850581Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1851706Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1852942Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1854116Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1855257Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1856390Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1856637Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:59.1856714Z Autotune Choices Stats: 2025-12-04T09:14:59.1858131Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.1858570Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1858892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1859456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1860620Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1861996Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1863226Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1864452Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1865619Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1866765Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1867919Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1869072Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1870319Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1871474Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1871758Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:59.1871892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1871962Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1872034Z unimplemented [] 2025-12-04T09:14:59.1872142Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1872359Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1873542Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1873605Z graph_break [] 2025-12-04T09:14:59.1873730Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1873800Z Autotune Choices Stats: 2025-12-04T09:14:59.1875198Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.1875457Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1875675Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1875998Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1877128Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1878336Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1879463Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1880623Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1881760Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1882885Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1883132Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:59.1883198Z Autotune Choices Stats: 2025-12-04T09:14:59.1884630Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1885070Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1885398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1886030Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1887196Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1888383Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1889569Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1890718Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1891874Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1893063Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1894217Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1895435Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1896587Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1897776Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1898020Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.1898187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1898258Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1898322Z unimplemented [] 2025-12-04T09:14:59.1898430Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1898616Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1899796Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1899861Z graph_break [] 2025-12-04T09:14:59.1899989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1900062Z Autotune Choices Stats: 2025-12-04T09:14:59.1901460Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1901712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1901933Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1902255Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1903390Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1904576Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1905722Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1906896Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1908014Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1909147Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1909393Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.1909469Z Autotune Choices Stats: 2025-12-04T09:14:59.1910888Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1911324Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1911758Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1912321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1913488Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1914680Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1915866Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1917011Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1918167Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1919334Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1920487Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1921698Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1922880Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1924072Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1924320Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.1924456Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1924527Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1924588Z unimplemented [] 2025-12-04T09:14:59.1924704Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1924892Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1926068Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1926130Z graph_break [] 2025-12-04T09:14:59.1926255Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1926327Z Autotune Choices Stats: 2025-12-04T09:14:59.1927721Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.1927969Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1928258Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1928579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1929708Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1930866Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1932013Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1933191Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1934309Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1935428Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1935670Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.1935743Z Autotune Choices Stats: 2025-12-04T09:14:59.1937168Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1937674Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1938002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1938564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1939809Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1941005Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1942164Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1943335Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1944494Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1945640Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1946862Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1948043Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1949230Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1950383Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1950629Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.1950764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1950834Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1950894Z unimplemented [] 2025-12-04T09:14:59.1951002Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1951184Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1952357Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1952428Z graph_break [] 2025-12-04T09:14:59.1952556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1952640Z Autotune Choices Stats: 2025-12-04T09:14:59.1954037Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.1954352Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1954569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1954889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1956117Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1957238Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1958386Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1959503Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1960625Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1961749Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1962197Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.1962276Z Autotune Choices Stats: 2025-12-04T09:14:59.1963835Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1964280Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1964607Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1965230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1966442Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1967608Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1968764Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1969917Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1971073Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1972353Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1973510Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1974698Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1975885Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.1977042Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1977287Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.1977416Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.1977493Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.1977560Z unimplemented [] 2025-12-04T09:14:59.1977670Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.1977853Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.1979029Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.1979095Z graph_break [] 2025-12-04T09:14:59.1979230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.1979298Z Autotune Choices Stats: 2025-12-04T09:14:59.1980759Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.1981006Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1981221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1981532Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1982699Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1983848Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1984969Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.1986090Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.1987200Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1988314Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1988623Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.1988707Z Autotune Choices Stats: 2025-12-04T09:14:59.1990138Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.1990613Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.1990937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.1991528Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.1992701Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1993861Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1995013Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.1996173Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.1997332Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.1998557Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.1999749Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2000923Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2002080Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2003231Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2003488Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.2003616Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2003690Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2003751Z unimplemented [] 2025-12-04T09:14:59.2003857Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2004039Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2005217Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2005370Z graph_break [] 2025-12-04T09:14:59.2005498Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2005570Z Autotune Choices Stats: 2025-12-04T09:14:59.2006955Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2007205Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2007454Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2007766Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2008937Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2010069Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2011179Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2012342Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2013454Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2014644Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2014896Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.2014969Z Autotune Choices Stats: 2025-12-04T09:14:59.2016425Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.2016865Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2017221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2017783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2018952Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2020112Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2021270Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2022430Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2023646Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2024822Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2026003Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2027149Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2028301Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2029454Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2029699Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.2029830Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2029906Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2029969Z unimplemented [] 2025-12-04T09:14:59.2030078Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2030262Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2031509Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2031575Z graph_break [] 2025-12-04T09:14:59.2031703Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2031777Z Autotune Choices Stats: 2025-12-04T09:14:59.2033202Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2033452Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2033670Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2034037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2035181Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2036306Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2037420Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2038535Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2039646Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2040830Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2041070Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.2041155Z Autotune Choices Stats: 2025-12-04T09:14:59.2042682Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2043119Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2043445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2044010Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2045173Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2046330Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2047478Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2048695Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2049856Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2051030Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2052210Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2053412Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2054568Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2055726Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2055969Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.2056164Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2056239Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2056300Z unimplemented [] 2025-12-04T09:14:59.2056401Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2056589Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2057762Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2057827Z graph_break [] 2025-12-04T09:14:59.2057955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2058025Z Autotune Choices Stats: 2025-12-04T09:14:59.2059480Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2059725Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2059940Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2060256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2061385Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2062666Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2063797Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2064917Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2066163Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2067344Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2067594Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.2067665Z Autotune Choices Stats: 2025-12-04T09:14:59.2069133Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.2069578Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2069904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2070474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2071632Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2072802Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2074025Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2075184Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2076374Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2077556Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2078707Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2079853Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2081003Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2082154Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2082468Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.2082597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2082673Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2082735Z unimplemented [] 2025-12-04T09:14:59.2082837Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2083029Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2084242Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2084316Z graph_break [] 2025-12-04T09:14:59.2084442Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2084514Z Autotune Choices Stats: 2025-12-04T09:14:59.2085933Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2086186Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2086404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2086722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2087851Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2088969Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2090096Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2091273Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2092444Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2093603Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2093878Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.2093952Z Autotune Choices Stats: 2025-12-04T09:14:59.2095377Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.2095817Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2096143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2096707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2097883Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2099044Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2100276Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2101462Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2102638Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2103793Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2104946Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2106090Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2107243Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2108450Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2108699Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.2108825Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2108906Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2108966Z unimplemented [] 2025-12-04T09:14:59.2109070Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2109256Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2110462Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2110529Z graph_break [] 2025-12-04T09:14:59.2110686Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2110753Z Autotune Choices Stats: 2025-12-04T09:14:59.2112162Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2112414Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2112631Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2112944Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2114076Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2115187Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2116376Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2117483Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2118626Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2119776Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2120019Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.2120092Z Autotune Choices Stats: 2025-12-04T09:14:59.2121518Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2121957Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2122285Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2122849Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2124026Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2125257Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2126449Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2127643Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2128794Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2129955Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2131127Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2132283Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2133478Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2134727Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2134995Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.2135160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2135237Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2135298Z unimplemented [] 2025-12-04T09:14:59.2135401Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2135593Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2136809Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2136878Z graph_break [] 2025-12-04T09:14:59.2137005Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2137075Z Autotune Choices Stats: 2025-12-04T09:14:59.2138473Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.2138717Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2138955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2139270Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2140398Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2141512Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2142695Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2143842Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2144983Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2146105Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2146349Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.2146421Z Autotune Choices Stats: 2025-12-04T09:14:59.2147847Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2148286Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2148611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2149175Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2155345Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2156596Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2157846Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2159048Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2160202Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2161373Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2162735Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2163892Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2165212Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2166421Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2166700Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.2166843Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2166925Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2166992Z unimplemented [] 2025-12-04T09:14:59.2167179Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2167381Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2168570Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2168641Z graph_break [] 2025-12-04T09:14:59.2168778Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2168847Z Autotune Choices Stats: 2025-12-04T09:14:59.2170254Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2170502Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2170732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2171065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2172206Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2173468Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2174622Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2175776Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2176896Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2178024Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2178271Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.2178350Z Autotune Choices Stats: 2025-12-04T09:14:59.2179784Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2180226Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2180554Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2181416Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2182598Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2183844Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2185039Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2186193Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2187341Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2188501Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2189654Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2190863Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2192049Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2193228Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2193481Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.2193617Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2193698Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2193761Z unimplemented [] 2025-12-04T09:14:59.2193872Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2194067Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2195260Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2195329Z graph_break [] 2025-12-04T09:14:59.2195459Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2195533Z Autotune Choices Stats: 2025-12-04T09:14:59.2196929Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2197175Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2197399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2197784Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2198919Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2200036Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2201213Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2202365Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2203489Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2204614Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2204865Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.2204933Z Autotune Choices Stats: 2025-12-04T09:14:59.2206363Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2206869Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2207193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2207752Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2208958Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2210143Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2211294Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2212530Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2213682Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2214841Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2216071Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2217224Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2218418Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2219604Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2219876Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.2220009Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2220087Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2220152Z unimplemented [] 2025-12-04T09:14:59.2220259Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2220454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2221634Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2221706Z graph_break [] 2025-12-04T09:14:59.2221835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2221904Z Autotune Choices Stats: 2025-12-04T09:14:59.2223307Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2223617Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2223849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2224166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2225305Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2226458Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2227615Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2228734Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2229855Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2230984Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2231230Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.2231298Z Autotune Choices Stats: 2025-12-04T09:14:59.2232730Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.2233242Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2233579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2234138Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2235339Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2236531Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2237684Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2238839Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2239989Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2241140Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2242355Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2243527Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2244716Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2245864Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2246113Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.2246243Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2246317Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2246379Z unimplemented [] 2025-12-04T09:14:59.2246483Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2246674Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2247852Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2247922Z graph_break [] 2025-12-04T09:14:59.2248048Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2248116Z Autotune Choices Stats: 2025-12-04T09:14:59.2249501Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2249810Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2250036Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2250364Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2251526Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2252741Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2253867Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2254987Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2256109Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2257234Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2257537Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.2257604Z Autotune Choices Stats: 2025-12-04T09:14:59.2259029Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2259459Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2259825Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2260388Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2261587Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2262958Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2264117Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2265296Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2266450Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2267766Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2268957Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2270171Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2271325Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2272480Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2272737Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.2272879Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2272952Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2273023Z unimplemented [] 2025-12-04T09:14:59.2273128Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2273320Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2274498Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2274636Z graph_break [] 2025-12-04T09:14:59.2274766Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2274835Z Autotune Choices Stats: 2025-12-04T09:14:59.2276231Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2276475Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2276699Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2277051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2278217Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2279332Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2280463Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2281584Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2282708Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2283837Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2284151Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.2284222Z Autotune Choices Stats: 2025-12-04T09:14:59.2285657Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.2286125Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2286460Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2287051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2288221Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2289382Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2290543Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2291696Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2292949Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2294100Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2295278Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2296479Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2297640Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2298788Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2299041Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.2299181Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2299253Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2299322Z unimplemented [] 2025-12-04T09:14:59.2299426Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2299619Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2300801Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2300937Z graph_break [] 2025-12-04T09:14:59.2301089Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2301164Z Autotune Choices Stats: 2025-12-04T09:14:59.2302567Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2302857Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2303078Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2303397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2304566Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2305690Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2306822Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2307943Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2309075Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2310262Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2310508Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.2310586Z Autotune Choices Stats: 2025-12-04T09:14:59.2312055Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2312535Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2312863Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2313430Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2314595Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2315755Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2316908Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2318066Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2319328Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2320525Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2321712Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2322860Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2324015Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2325173Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2325421Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.2325554Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2325633Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2325698Z unimplemented [] 2025-12-04T09:14:59.2325874Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2326059Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2327242Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2327314Z graph_break [] 2025-12-04T09:14:59.2327444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2327519Z Autotune Choices Stats: 2025-12-04T09:14:59.2328965Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.2329219Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2329468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2329782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2330922Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2332053Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2333273Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2334390Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2335604Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2336740Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2337026Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.2337100Z Autotune Choices Stats: 2025-12-04T09:14:59.2338551Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2339000Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2339332Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2339896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2341062Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2342226Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2343373Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2344586Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2345766Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2346937Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2348093Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2349248Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2350404Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2351552Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2351871Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.2352004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2352079Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2352142Z unimplemented [] 2025-12-04T09:14:59.2352250Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2352436Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2353617Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2353683Z graph_break [] 2025-12-04T09:14:59.2353851Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2353925Z Autotune Choices Stats: 2025-12-04T09:14:59.2355348Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.2355602Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2355827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2356145Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2357276Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2358382Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2359511Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2360629Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2361987Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2363191Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2363438Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.2363510Z Autotune Choices Stats: 2025-12-04T09:14:59.2364988Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2365429Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2365755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2366318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2367478Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2368638Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2369880Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2371069Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2372260Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2373489Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2374647Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2375796Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2376962Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2378107Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2378422Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.2378552Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2378628Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2378692Z unimplemented [] 2025-12-04T09:14:59.2378795Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2378984Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2380190Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2380259Z graph_break [] 2025-12-04T09:14:59.2380387Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2380460Z Autotune Choices Stats: 2025-12-04T09:14:59.2381877Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2382140Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2382358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2382678Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2383811Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2384920Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2386034Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2387255Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2388408Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2389557Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2389799Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.2389876Z Autotune Choices Stats: 2025-12-04T09:14:59.2391295Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.2391732Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2392058Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2392629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2393788Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2395031Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2396185Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2397381Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2398574Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2399727Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2400886Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2402037Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2403194Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2404417Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2404662Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.2404801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2404873Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2404936Z unimplemented [] 2025-12-04T09:14:59.2405078Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2405264Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2406475Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2406546Z graph_break [] 2025-12-04T09:14:59.2406675Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2406753Z Autotune Choices Stats: 2025-12-04T09:14:59.2408147Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2408397Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2408616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2408932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2410064Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2411177Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2412418Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2413575Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2414686Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2415830Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2416074Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.2416149Z Autotune Choices Stats: 2025-12-04T09:14:59.2417563Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.2418003Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2418331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2418894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2420057Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2421287Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2422474Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2423684Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2424847Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2425997Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2427154Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2428302Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2429515Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2430669Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2430948Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.2431078Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2431153Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2431226Z unimplemented [] 2025-12-04T09:14:59.2431333Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2431512Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2432725Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2432797Z graph_break [] 2025-12-04T09:14:59.2432926Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2432999Z Autotune Choices Stats: 2025-12-04T09:14:59.2434382Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2434630Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2434848Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2435160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2436288Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2437469Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2438576Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2439728Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2440871Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2441987Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2442229Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:59.2442303Z Autotune Choices Stats: 2025-12-04T09:14:59.2443714Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.2444152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2444478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2445125Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2446291Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2447475Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2448661Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2449828Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2450986Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2452132Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2453342Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2454603Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2455755Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2456954Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2457198Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:59.2457359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2457439Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2457502Z unimplemented [] 2025-12-04T09:14:59.2457612Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2457808Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2458983Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2459061Z graph_break [] 2025-12-04T09:14:59.2459190Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2459265Z Autotune Choices Stats: 2025-12-04T09:14:59.2460647Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.2460894Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2461112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2461427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2462730Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2463991Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2465159Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2466329Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2467456Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2468578Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2468824Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:59.2468900Z Autotune Choices Stats: 2025-12-04T09:14:59.2470329Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.2470769Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2471165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2471729Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2472900Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2474094Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2475277Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2476431Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2477582Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2478734Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2479901Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2481110Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2482294Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2483476Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2483723Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:59.2483859Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2483940Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2484002Z unimplemented [] 2025-12-04T09:14:59.2484121Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2484307Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2485478Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2485548Z graph_break [] 2025-12-04T09:14:59.2485675Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2485747Z Autotune Choices Stats: 2025-12-04T09:14:59.2487135Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.2487382Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2487683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2487996Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2489124Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2490296Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2491462Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2492628Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2493746Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2494868Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2495111Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:59.2495194Z Autotune Choices Stats: 2025-12-04T09:14:59.2496626Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:59.2497127Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2497456Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2498020Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2499232Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2500418Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2501568Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2502724Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2503874Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2505023Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2506245Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2507426Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2508616Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2509766Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2510012Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:59.2510141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2510218Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2510281Z unimplemented [] 2025-12-04T09:14:59.2510383Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2510578Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2511747Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2511817Z graph_break [] 2025-12-04T09:14:59.2511942Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2512016Z Autotune Choices Stats: 2025-12-04T09:14:59.2513398Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2513712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2513926Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2514238Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2515375Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2516522Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2517671Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2518795Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2519915Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2521037Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2521283Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:59.2521354Z Autotune Choices Stats: 2025-12-04T09:14:59.2522782Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.2523303Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2523629Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2524230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2525425Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2526582Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2527734Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2528890Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2530052Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2531261Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2532480Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2533663Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2534865Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2536019Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2536273Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:59.2536403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2536481Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2536548Z unimplemented [] 2025-12-04T09:14:59.2536650Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2536843Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2538016Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2538086Z graph_break [] 2025-12-04T09:14:59.2538214Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2538288Z Autotune Choices Stats: 2025-12-04T09:14:59.2539738Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2539985Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2540205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2540519Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2541685Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2542822Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2543945Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2545057Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2546172Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2547301Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2547609Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:59.2547682Z Autotune Choices Stats: 2025-12-04T09:14:59.2549109Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.2549543Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2549901Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2550461Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2551671Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2552832Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2553986Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2555146Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2556293Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2557510Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2558694Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2559877Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2561033Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2562359Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2562622Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:59.2562755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2562832Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2562896Z unimplemented [] 2025-12-04T09:14:59.2562998Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2563188Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2564367Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2564545Z graph_break [] 2025-12-04T09:14:59.2564677Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2564747Z Autotune Choices Stats: 2025-12-04T09:14:59.2566147Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.2566401Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2566666Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2566982Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2568158Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2569278Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2570395Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2571510Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2572681Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2573885Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2574130Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:59.2574204Z Autotune Choices Stats: 2025-12-04T09:14:59.2575664Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.2576104Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2576459Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2577023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2578191Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2579348Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2580501Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2581653Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2582861Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2584070Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2585296Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2586446Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2587603Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2588757Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2589012Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:59.2589141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2589229Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2589297Z unimplemented [] 2025-12-04T09:14:59.2589402Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2589595Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2590836Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2590907Z graph_break [] 2025-12-04T09:14:59.2591038Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2591106Z Autotune Choices Stats: 2025-12-04T09:14:59.2592529Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.2592777Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2593006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2593350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2594484Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2595596Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2596720Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2597837Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2598950Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2600133Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2600374Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:14:59.2600450Z Autotune Choices Stats: 2025-12-04T09:14:59.2601911Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.2602382Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2602719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2603283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2604443Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2605609Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2606759Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2607976Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2609128Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2610313Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2611498Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2612702Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2613861Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2615011Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2615273Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:14:59.2615488Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2615565Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2615628Z unimplemented [] 2025-12-04T09:14:59.2615732Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2615923Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2617094Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2617161Z graph_break [] 2025-12-04T09:14:59.2617308Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2617379Z Autotune Choices Stats: 2025-12-04T09:14:59.2618814Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.2619088Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2619313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2619627Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2620765Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2621886Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2623015Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2624142Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2625626Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2626786Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2627031Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:14:59.2627103Z Autotune Choices Stats: 2025-12-04T09:14:59.2628560Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:14:59.2629002Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2629326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2629889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2631058Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2632224Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2633375Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2634604Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2635783Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2636967Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2638126Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2639280Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2640438Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2641589Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2641902Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:14:59.2642032Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2642108Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2642172Z unimplemented [] 2025-12-04T09:14:59.2642276Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2642467Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2643677Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2643749Z graph_break [] 2025-12-04T09:14:59.2643876Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2643945Z Autotune Choices Stats: 2025-12-04T09:14:59.2645364Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.2645613Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2645839Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2646150Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2647285Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2648407Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2649542Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2650751Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2651879Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2653088Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2653367Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:14:59.2653446Z Autotune Choices Stats: 2025-12-04T09:14:59.2654874Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.2655311Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2655636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2656200Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2657377Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2658550Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2659767Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2660950Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2662333Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2663516Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2664691Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2665843Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2667002Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2668242Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2668498Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:14:59.2668674Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:59.2668756Z Traceback (most recent call last): 2025-12-04T09:14:59.2669058Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:59.2669129Z self.assertTrue( 2025-12-04T09:14:59.2669347Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:59.2669477Z raise self.failureException(msg) 2025-12-04T09:14:59.2669733Z AssertionError: False is not true : Log file /tmp/tmpjvws9kal/flex_attention_configs.json was not created 2025-12-04T09:14:59.2669744Z 2025-12-04T09:14:59.2669881Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:59.2670134Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:59.2670138Z 2025-12-04T09:14:59.2670349Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:59.2670481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2670556Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2670626Z unimplemented [] 2025-12-04T09:14:59.2670746Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2671943Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:59.2672130Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2672193Z graph_break [] 2025-12-04T09:14:59.2672326Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2673308Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:59.2673397Z current_size = base.storage().size() 2025-12-04T09:14:59.2673464Z Autotune Choices Stats: 2025-12-04T09:14:59.2674868Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.2675188Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2675418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2675735Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2676860Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2678004Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2679164Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2680277Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2681388Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2682502Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2682772Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:59.2682841Z Autotune Choices Stats: 2025-12-04T09:14:59.2684265Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.2684766Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2685099Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2685658Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2686868Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2688045Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2689203Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2690349Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2691511Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2692716Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2693934Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2695119Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2696296Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2697447Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2697707Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:59.2697844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2697921Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2697993Z unimplemented [] 2025-12-04T09:14:59.2698100Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2698290Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2699489Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2699562Z graph_break [] 2025-12-04T09:14:59.2699703Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2699777Z Autotune Choices Stats: 2025-12-04T09:14:59.2701218Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2701573Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2701859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2702179Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2703373Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2704630Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2705753Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2706903Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2708029Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2709139Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2709390Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:59.2709526Z Autotune Choices Stats: 2025-12-04T09:14:59.2710958Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2711428Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2711813Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2712374Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2713582Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2714736Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2715888Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2717044Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2718205Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2719423Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2720658Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2721851Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2723003Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2724150Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2724405Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:59.2724534Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2724608Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2724679Z unimplemented [] 2025-12-04T09:14:59.2724786Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2724974Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2726159Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2726218Z graph_break [] 2025-12-04T09:14:59.2726353Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2726491Z Autotune Choices Stats: 2025-12-04T09:14:59.2727885Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2728128Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2728350Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2728665Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2729830Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2730978Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2732096Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2733325Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2734441Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2735548Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2735865Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:59.2735932Z Autotune Choices Stats: 2025-12-04T09:14:59.2737358Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2737824Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2738155Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2738749Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2739925Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2741083Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2742247Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2743406Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2744634Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2745781Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2746997Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2748186Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2749347Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2750491Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2750750Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:59.2750882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2750954Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2751027Z unimplemented [] 2025-12-04T09:14:59.2751129Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2751313Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2752500Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2752625Z graph_break [] 2025-12-04T09:14:59.2752760Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2752841Z Autotune Choices Stats: 2025-12-04T09:14:59.2754243Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2754526Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2754755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2755072Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2756224Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2757348Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2758465Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2759574Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2760691Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2762036Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2762310Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:59.2762382Z Autotune Choices Stats: 2025-12-04T09:14:59.2763896Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2764339Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2764723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2765292Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2766472Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2767629Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2768787Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2769952Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2771209Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2772454Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2773634Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2774783Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2775937Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2777091Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2777348Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:59.2777485Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2777559Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2777630Z unimplemented [] 2025-12-04T09:14:59.2777732Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2778005Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2779205Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2779266Z graph_break [] 2025-12-04T09:14:59.2779403Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2779470Z Autotune Choices Stats: 2025-12-04T09:14:59.2780890Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2781151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2781407Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2781722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2782847Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2783963Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2785085Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2786206Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2787383Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2788486Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2788736Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:59.2788803Z Autotune Choices Stats: 2025-12-04T09:14:59.2790300Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2790732Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2791067Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2791624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2792783Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2793939Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2795092Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2796297Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2797473Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2798650Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2799800Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2800949Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2802093Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2803241Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2803496Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:59.2803688Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2803759Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2803835Z unimplemented [] 2025-12-04T09:14:59.2803940Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2804124Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2805309Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2805373Z graph_break [] 2025-12-04T09:14:59.2805505Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2805574Z Autotune Choices Stats: 2025-12-04T09:14:59.2807056Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.2807302Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2807523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2807847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2808967Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2810092Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2811205Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2812359Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2813550Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2814708Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2814964Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:59.2815032Z Autotune Choices Stats: 2025-12-04T09:14:59.2816500Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2816938Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2817268Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2817834Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2819010Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2820162Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2821378Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2822532Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2823714Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2824890Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2826041Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2827201Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2828350Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2829498Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2829802Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:59.2829939Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2830011Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2830081Z unimplemented [] 2025-12-04T09:14:59.2830182Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2830367Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2831585Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2831650Z graph_break [] 2025-12-04T09:14:59.2831785Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2831853Z Autotune Choices Stats: 2025-12-04T09:14:59.2833293Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2833540Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2833762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2834086Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2835210Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2836328Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2837452Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2838626Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2839792Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2840899Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2841178Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:59.2841245Z Autotune Choices Stats: 2025-12-04T09:14:59.2842670Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.2843104Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2843432Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2843994Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2845155Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2846317Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2847544Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2848720Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2849907Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2851064Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2852203Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2853399Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2854544Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2855797Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2856043Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:59.2856192Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2856262Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2856333Z unimplemented [] 2025-12-04T09:14:59.2856436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2856653Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2857833Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2857924Z graph_break [] 2025-12-04T09:14:59.2858057Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2858125Z Autotune Choices Stats: 2025-12-04T09:14:59.2859518Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.2859769Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2859987Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2860305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2861437Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2862710Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2863948Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2865064Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2866226Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2867380Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2867635Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:59.2867701Z Autotune Choices Stats: 2025-12-04T09:14:59.2869126Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2869561Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2869892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2870454Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2871623Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2872860Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2874049Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2875226Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2876380Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2877531Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2878680Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2879840Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2881048Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2882201Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2882447Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.2882617Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2882690Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2882760Z unimplemented [] 2025-12-04T09:14:59.2882861Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2883043Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2884255Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2884320Z graph_break [] 2025-12-04T09:14:59.2884463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2884532Z Autotune Choices Stats: 2025-12-04T09:14:59.2885929Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2886179Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2886402Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2886722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2887844Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2888964Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2890148Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2891300Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2892508Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2893629Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2893877Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.2893944Z Autotune Choices Stats: 2025-12-04T09:14:59.2895374Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2895810Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2896139Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2896698Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2897930Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2899072Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2900255Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2901436Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2902596Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2903749Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2904900Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2906067Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2907300Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2908483Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2908729Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.2908866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2908939Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2909046Z unimplemented [] 2025-12-04T09:14:59.2909148Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2909332Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2910512Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2910572Z graph_break [] 2025-12-04T09:14:59.2910707Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2910773Z Autotune Choices Stats: 2025-12-04T09:14:59.2912157Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2912406Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2912624Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2912946Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2914067Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2915255Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2916396Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2917548Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2918666Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2919786Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2920032Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.2920103Z Autotune Choices Stats: 2025-12-04T09:14:59.2921532Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2921964Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2922294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2922913Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2924096Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2925279Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2926460Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2927608Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2928769Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2929912Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2931063Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2932287Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2933519Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2934729Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2934977Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.2935113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2935186Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2935249Z unimplemented [] 2025-12-04T09:14:59.2935361Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2935544Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2936720Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2936783Z graph_break [] 2025-12-04T09:14:59.2936921Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2936999Z Autotune Choices Stats: 2025-12-04T09:14:59.2938386Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.2938633Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2938851Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2939239Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2940363Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2941486Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2942625Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2943774Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2944890Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2945996Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2946246Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.2946311Z Autotune Choices Stats: 2025-12-04T09:14:59.2947739Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2948236Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2948575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2949139Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2950329Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2951520Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2952674Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2953823Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2954981Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2956128Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2957348Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2958494Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2959669Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2960850Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2961098Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.2961233Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2961301Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2961366Z unimplemented [] 2025-12-04T09:14:59.2961476Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2961658Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2963006Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2963071Z graph_break [] 2025-12-04T09:14:59.2963198Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2963271Z Autotune Choices Stats: 2025-12-04T09:14:59.2964663Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.2965022Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2965240Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2965559Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2966681Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2967871Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2969033Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2970155Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2971268Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2972482Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2972740Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.2972807Z Autotune Choices Stats: 2025-12-04T09:14:59.2974225Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.2974734Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2975059Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2975656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2976842Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2978014Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2979172Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2980321Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2981478Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.2982619Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2983837Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.2985023Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.2986194Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2987351Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.2987593Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.2987732Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.2987801Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.2987864Z unimplemented [] 2025-12-04T09:14:59.2987973Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.2988158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.2989338Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.2989398Z graph_break [] 2025-12-04T09:14:59.2989527Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.2989599Z Autotune Choices Stats: 2025-12-04T09:14:59.2990980Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.2991340Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.2991557Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.2991880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.2993037Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2994181Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2995291Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2996409Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.2997525Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.2998632Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.2998960Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.2999027Z Autotune Choices Stats: 2025-12-04T09:14:59.3000448Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.3000884Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3001242Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3001802Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3002993Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3004153Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3005301Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3006447Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3007600Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3008817Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3010004Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3011191Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3012393Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3013548Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3013790Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.3013933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3014008Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3014072Z unimplemented [] 2025-12-04T09:14:59.3014181Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3014367Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3015543Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3015672Z graph_break [] 2025-12-04T09:14:59.3015800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3015874Z Autotune Choices Stats: 2025-12-04T09:14:59.3017259Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3017507Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3017728Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3018092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3019252Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3020371Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3021490Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3022624Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3023738Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3024859Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3025172Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.3025248Z Autotune Choices Stats: 2025-12-04T09:14:59.3026701Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3027143Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3027468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3028088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3029254Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3030410Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3031555Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3032705Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3033930Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3035068Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3036258Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3037438Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3038583Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3039729Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3039972Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.3040106Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3040177Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3040249Z unimplemented [] 2025-12-04T09:14:59.3040359Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3040544Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3041723Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3041854Z graph_break [] 2025-12-04T09:14:59.3041982Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3042053Z Autotune Choices Stats: 2025-12-04T09:14:59.3043473Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3043727Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3043944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3044263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3045420Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3046556Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3047699Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3048839Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3049972Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3051166Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3051412Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.3051497Z Autotune Choices Stats: 2025-12-04T09:14:59.3053029Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.3053507Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3053836Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3054415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3055594Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3056769Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3057933Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3059164Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3060333Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3061533Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3062925Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3064091Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3065261Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3066417Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3066669Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.3066803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3066881Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3067286Z unimplemented [] 2025-12-04T09:14:59.3067391Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3067588Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3068776Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3068845Z graph_break [] 2025-12-04T09:14:59.3068975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3069042Z Autotune Choices Stats: 2025-12-04T09:14:59.3070509Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3070757Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3071017Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3071335Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3072485Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3073624Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3074755Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3075879Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3077076Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3078207Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3078489Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.3078565Z Autotune Choices Stats: 2025-12-04T09:14:59.3080056Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.3080505Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3080837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3081402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3082597Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3083771Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3084930Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3086162Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3087354Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3088553Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3089723Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3090881Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3092048Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3093279Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3093602Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.3093733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3093808Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3093871Z unimplemented [] 2025-12-04T09:14:59.3093973Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3094167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3095353Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3095423Z graph_break [] 2025-12-04T09:14:59.3095603Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3095671Z Autotune Choices Stats: 2025-12-04T09:14:59.3097114Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3097363Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3097590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3097923Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3099071Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3100198Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3101328Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3102515Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3103644Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3104806Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3105050Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.3105118Z Autotune Choices Stats: 2025-12-04T09:14:59.3106598Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3107045Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3107372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3107938Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3109122Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3110287Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3111512Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3112723Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3113917Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3115090Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3116260Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3117419Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3118587Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3119747Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3120074Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.3120208Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3120284Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3120346Z unimplemented [] 2025-12-04T09:14:59.3120447Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3120637Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3121855Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3121923Z graph_break [] 2025-12-04T09:14:59.3122052Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3122118Z Autotune Choices Stats: 2025-12-04T09:14:59.3123616Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.3123868Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3124095Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3124415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3125556Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3126678Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3127812Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3129012Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3135491Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3136781Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3137058Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.3137139Z Autotune Choices Stats: 2025-12-04T09:14:59.3138673Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3139139Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3139488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3140078Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3141297Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3142577Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3143769Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3145008Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3146236Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3147421Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3148612Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3149802Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3150990Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3152240Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3152504Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.3152657Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3152733Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3152801Z unimplemented [] 2025-12-04T09:14:59.3152951Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3153149Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3154406Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3154477Z graph_break [] 2025-12-04T09:14:59.3154613Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3154694Z Autotune Choices Stats: 2025-12-04T09:14:59.3156211Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3156468Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3156689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3157031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3158209Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3159369Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3160582Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3161769Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3163225Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3164398Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3164659Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.3164738Z Autotune Choices Stats: 2025-12-04T09:14:59.3166224Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3166690Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3167016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3167599Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3168818Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3170133Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3171371Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3172675Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3173873Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3175063Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3176269Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3177474Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3178730Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3179962Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3180222Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.3180367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3180440Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3180505Z unimplemented [] 2025-12-04T09:14:59.3180622Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3180848Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3182069Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3182145Z graph_break [] 2025-12-04T09:14:59.3182276Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3182353Z Autotune Choices Stats: 2025-12-04T09:14:59.3183801Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3184060Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3184279Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3184606Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3185778Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3187003Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3188153Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3189338Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3190518Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3191681Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3191935Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.3192023Z Autotune Choices Stats: 2025-12-04T09:14:59.3193499Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3193949Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3194277Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3194925Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3196135Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3197368Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3198608Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3199803Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3200999Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3202193Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3203392Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3204645Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3205841Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3207067Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3207324Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.3207485Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3207565Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3207630Z unimplemented [] 2025-12-04T09:14:59.3207741Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3207935Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3209147Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3209215Z graph_break [] 2025-12-04T09:14:59.3209348Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3209428Z Autotune Choices Stats: 2025-12-04T09:14:59.3210871Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3211127Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3211347Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3211671Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3212964Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3214127Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3215315Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3216504Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3217663Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3218819Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3219070Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.3219146Z Autotune Choices Stats: 2025-12-04T09:14:59.3220620Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.3221061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3221450Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3222032Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3223232Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3224465Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3225689Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3226887Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3228088Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3229277Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3230464Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3231711Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3232950Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3234172Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3234433Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.3234574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3234653Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3234718Z unimplemented [] 2025-12-04T09:14:59.3234832Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3235024Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3236242Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3236313Z graph_break [] 2025-12-04T09:14:59.3236444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3236526Z Autotune Choices Stats: 2025-12-04T09:14:59.3237969Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3238222Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3238510Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3238837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3240008Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3241197Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3242386Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3243547Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3244704Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3245865Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3246122Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.3246199Z Autotune Choices Stats: 2025-12-04T09:14:59.3247669Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3248176Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3248508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3249091Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3250329Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3251558Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3252789Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3253982Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3255185Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3256372Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3257633Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3258858Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3260084Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3261277Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3261533Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.3261664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3261744Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3261955Z unimplemented [] 2025-12-04T09:14:59.3262071Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3262271Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3263488Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3263559Z graph_break [] 2025-12-04T09:14:59.3263689Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3263765Z Autotune Choices Stats: 2025-12-04T09:14:59.3265205Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3265593Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3265813Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3266140Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3267437Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3268636Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3269799Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3270967Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3272118Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3273274Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3273526Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.3273604Z Autotune Choices Stats: 2025-12-04T09:14:59.3275143Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.3275587Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3275914Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3276526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3277766Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3278966Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3280153Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3281345Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3282540Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3283788Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3284986Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3286206Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3287428Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3288616Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3288873Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.3289003Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3289083Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3289149Z unimplemented [] 2025-12-04T09:14:59.3289254Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3289451Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3290664Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3290733Z graph_break [] 2025-12-04T09:14:59.3290862Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3290937Z Autotune Choices Stats: 2025-12-04T09:14:59.3292499Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3292758Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3292985Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3293313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3294524Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3295699Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3296863Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3298018Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3299168Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3300324Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3300658Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.3300732Z Autotune Choices Stats: 2025-12-04T09:14:59.3302207Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3302682Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3303011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3303621Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3304819Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3306012Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3307202Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3308395Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3309590Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3310844Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3312071Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3313299Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3314503Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3315701Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3315965Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.3316097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3316176Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3316241Z unimplemented [] 2025-12-04T09:14:59.3316353Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3316552Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3317766Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3317903Z graph_break [] 2025-12-04T09:14:59.3318031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3318100Z Autotune Choices Stats: 2025-12-04T09:14:59.3319552Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.3319807Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3320057Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3320384Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3321584Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3322741Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3323904Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3325058Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3326209Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3327429Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3327679Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.3327758Z Autotune Choices Stats: 2025-12-04T09:14:59.3329262Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3329710Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3330096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3330680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3331888Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3333144Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3334334Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3335534Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3336782Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3337998Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3339220Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3340410Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3341603Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3342791Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3343046Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.3343177Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3343255Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3343321Z unimplemented [] 2025-12-04T09:14:59.3343424Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3343629Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3344916Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3344984Z graph_break [] 2025-12-04T09:14:59.3345114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3345182Z Autotune Choices Stats: 2025-12-04T09:14:59.3346653Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.3346899Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3347124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3347477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3348644Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3349795Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3350955Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3352116Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3353265Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3354480Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3354744Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.3354821Z Autotune Choices Stats: 2025-12-04T09:14:59.3356361Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3356802Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3357134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3357718Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3358929Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3360137Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3361330Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3362795Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3364074Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3365331Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3366524Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3367713Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3368906Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3370093Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3370355Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.3370580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3370665Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3370739Z unimplemented [] 2025-12-04T09:14:59.3370847Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3371046Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3372271Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3372389Z graph_break [] 2025-12-04T09:14:59.3372525Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3372595Z Autotune Choices Stats: 2025-12-04T09:14:59.3374122Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3374368Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3374608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3374937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3376123Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3377283Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3378443Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3379605Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3380845Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3382037Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3382292Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.3382382Z Autotune Choices Stats: 2025-12-04T09:14:59.3383892Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.3384344Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3384669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3385258Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3386471Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3387671Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3388922Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3390124Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3391349Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3392590Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3393785Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3394974Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3396173Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3397366Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3397709Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.3397844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3397924Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3397990Z unimplemented [] 2025-12-04T09:14:59.3398101Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3398304Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3399563Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3399635Z graph_break [] 2025-12-04T09:14:59.3399768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3399848Z Autotune Choices Stats: 2025-12-04T09:14:59.3401336Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.3401591Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3401818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3402147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3403334Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3404488Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3405653Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3406921Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3408075Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3409263Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3409552Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.3409628Z Autotune Choices Stats: 2025-12-04T09:14:59.3411109Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.3411556Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3411885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3412512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3413738Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3414946Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3416212Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3417440Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3418664Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3419865Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3421059Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3422245Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3423439Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3424690Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3424950Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.3425082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3425160Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3425226Z unimplemented [] 2025-12-04T09:14:59.3425338Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3425538Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3426790Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3426858Z graph_break [] 2025-12-04T09:14:59.3427032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3427105Z Autotune Choices Stats: 2025-12-04T09:14:59.3428565Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3428815Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3429042Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3429365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3430579Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3431760Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3433012Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3434168Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3435358Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3436546Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3436800Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:59.3436869Z Autotune Choices Stats: 2025-12-04T09:14:59.3438358Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.3438807Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3439142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3439715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3440924Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3442192Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3443415Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3444646Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3445840Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3447036Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3448231Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3449468Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3450731Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3451921Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3452191Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:59.3452433Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3452532Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3452603Z unimplemented [] 2025-12-04T09:14:59.3452723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3452917Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3454184Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3454258Z graph_break [] 2025-12-04T09:14:59.3454423Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3454501Z Autotune Choices Stats: 2025-12-04T09:14:59.3455952Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.3456206Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3456429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3456766Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3457940Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3459099Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3460322Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3461519Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3462932Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3464099Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3464360Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:59.3464431Z Autotune Choices Stats: 2025-12-04T09:14:59.3465917Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.3466359Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3466692Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3467265Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3468582Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3469785Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3471034Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3472250Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3473442Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3474634Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3475834Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3477033Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3478287Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3479514Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3479769Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:59.3479906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3479976Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3480041Z unimplemented [] 2025-12-04T09:14:59.3480183Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3480377Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3481597Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3481660Z graph_break [] 2025-12-04T09:14:59.3481795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3481863Z Autotune Choices Stats: 2025-12-04T09:14:59.3483303Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.3483559Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3483777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3484107Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3485276Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3486510Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3487696Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3488887Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3490048Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3491205Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3491460Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:59.3491533Z Autotune Choices Stats: 2025-12-04T09:14:59.3493079Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:59.3493526Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3493869Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3494563Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3495781Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3497038Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3498271Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3501470Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3502692Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3503908Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3505097Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3506340Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3507572Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3508792Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3509057Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:59.3509196Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3509280Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3509426Z unimplemented [] 2025-12-04T09:14:59.3509537Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3509745Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3510966Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3511035Z graph_break [] 2025-12-04T09:14:59.3511169Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3511242Z Autotune Choices Stats: 2025-12-04T09:14:59.3512707Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3512978Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3513201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3513569Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3514749Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3515899Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3517083Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3518266Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3519479Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3520643Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3520898Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:59.3520974Z Autotune Choices Stats: 2025-12-04T09:14:59.3522459Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.3522942Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3523266Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3523852Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3525080Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3526311Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3527512Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3528738Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3529921Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3531113Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3532786Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3533976Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3535206Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3536423Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3536722Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:59.3536855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3536931Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3536993Z unimplemented [] 2025-12-04T09:14:59.3537099Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3537299Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3538515Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3538584Z graph_break [] 2025-12-04T09:14:59.3538712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3538780Z Autotune Choices Stats: 2025-12-04T09:14:59.3540239Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.3540526Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3540756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3541084Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3542273Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3543452Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3544642Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3545870Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3547025Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3548180Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3548431Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:59.3548508Z Autotune Choices Stats: 2025-12-04T09:14:59.3549993Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.3550475Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3550802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3551381Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3552616Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3553843Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3555065Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3556261Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3557457Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3558652Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3559883Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3561111Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3562548Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3563782Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3564122Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:59.3564259Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3564338Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3564401Z unimplemented [] 2025-12-04T09:14:59.3564507Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3564705Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3565920Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3565999Z graph_break [] 2025-12-04T09:14:59.3566134Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3566205Z Autotune Choices Stats: 2025-12-04T09:14:59.3567657Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.3567954Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3568180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3568506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3569746Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3571218Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3572761Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3574006Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3575161Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3576325Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3576624Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:59.3576705Z Autotune Choices Stats: 2025-12-04T09:14:59.3578181Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.3578629Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3579168Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3580070Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3581329Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3582571Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3583769Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3584975Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3587463Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3589990Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3592485Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3594986Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3597432Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3599928Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3601449Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:59.3601938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3602237Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3602435Z unimplemented [] 2025-12-04T09:14:59.3602651Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3603036Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3604818Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3606250Z graph_break [] 2025-12-04T09:14:59.3606496Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3606786Z Autotune Choices Stats: 2025-12-04T09:14:59.3608371Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.3610113Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3610655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3611323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3613091Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3615484Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3617895Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3620276Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3622645Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3625022Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3626539Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:14:59.3626953Z Autotune Choices Stats: 2025-12-04T09:14:59.3628549Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.3630711Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3631546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3632566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3634423Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3636928Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3639389Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3641839Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3644326Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3646766Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3649264Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3651734Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3654282Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3656732Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3658240Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:14:59.3658705Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3658992Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3659195Z unimplemented [] 2025-12-04T09:14:59.3659405Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3659781Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3661291Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3662891Z graph_break [] 2025-12-04T09:14:59.3663116Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3663403Z Autotune Choices Stats: 2025-12-04T09:14:59.3664978Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.3666810Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3667359Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3667975Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3670005Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3674419Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3678723Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3683044Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3687335Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3691726Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3694522Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:14:59.3695247Z Autotune Choices Stats: 2025-12-04T09:14:59.3698187Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:14:59.3701825Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3703381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3705234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3708606Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3713081Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3717521Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3721966Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3726474Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3731070Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3735638Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3740184Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3744625Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3749080Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3751823Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:14:59.3752665Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3753164Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3753572Z unimplemented [] 2025-12-04T09:14:59.3753937Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3754598Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3757328Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3759813Z graph_break [] 2025-12-04T09:14:59.3760212Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3760713Z Autotune Choices Stats: 2025-12-04T09:14:59.3763784Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.3766950Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3767997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3769123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3772042Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3776438Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3780742Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3785039Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3789429Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3793730Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3796449Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:14:59.3797161Z Autotune Choices Stats: 2025-12-04T09:14:59.3800050Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.3803634Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3805223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3807019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3810394Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3814915Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3819362Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3823886Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3828379Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3832871Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3837311Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3841800Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3846246Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3850689Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3853521Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:14:59.3854360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3854862Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3855195Z unimplemented [] 2025-12-04T09:14:59.3855557Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3856214Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3858937Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.3861403Z graph_break [] 2025-12-04T09:14:59.3861995Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3862507Z Autotune Choices Stats: 2025-12-04T09:14:59.3865347Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.3868495Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3869540Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3870654Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3873493Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3877791Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3882087Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3886463Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.3890771Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.3895219Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3897891Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:14:59.3898601Z Autotune Choices Stats: 2025-12-04T09:14:59.3901551Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.3905166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3906714Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3908498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3911871Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3916315Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3920817Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3925290Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3929795Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3934282Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.3938813Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.3943246Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.3947680Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3952119Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.3954916Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:14:59.3955840Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:59.3956450Z Traceback (most recent call last): 2025-12-04T09:14:59.3957246Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:59.3958050Z self.assertTrue( 2025-12-04T09:14:59.3958583Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:59.3959229Z raise self.failureException(msg) 2025-12-04T09:14:59.3959986Z AssertionError: False is not true : Log file /tmp/tmpmpfvm_pd/flex_attention_configs.json was not created 2025-12-04T09:14:59.3960572Z 2025-12-04T09:14:59.3960807Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:59.3961658Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:59.3962445Z 2025-12-04T09:14:59.3962740Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:59.3963412Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.3963989Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.3964345Z unimplemented [] 2025-12-04T09:14:59.3964711Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.3967281Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:59.3970061Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.3970652Z graph_break [] 2025-12-04T09:14:59.3971053Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.3973318Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:59.3975433Z current_size = base.storage().size() 2025-12-04T09:14:59.3975818Z Autotune Choices Stats: 2025-12-04T09:14:59.3978612Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.3981754Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.3982819Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.3983942Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.3986770Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3991129Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3995444Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.3999739Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4004077Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4008356Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4011018Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:59.4011732Z Autotune Choices Stats: 2025-12-04T09:14:59.4014639Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.4018264Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4019818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4021603Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4025022Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4029515Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4034029Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4038454Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4042881Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4047305Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4051792Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4056332Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4060857Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4065423Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4068235Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:59.4069072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4069576Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4069924Z unimplemented [] 2025-12-04T09:14:59.4070285Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4070937Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4073659Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4076132Z graph_break [] 2025-12-04T09:14:59.4076531Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4077029Z Autotune Choices Stats: 2025-12-04T09:14:59.4079817Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4083046Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4084031Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4085161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4087999Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4092456Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4096813Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4101160Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4105456Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4109757Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4112431Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:59.4113149Z Autotune Choices Stats: 2025-12-04T09:14:59.4116011Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4119671Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4121213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4123063Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4126486Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4130945Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4135477Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4139912Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4144348Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4148835Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4153270Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4157749Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4162405Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4166906Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4169633Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:59.4170468Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4170972Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4171313Z unimplemented [] 2025-12-04T09:14:59.4171673Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4172467Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4175167Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4177615Z graph_break [] 2025-12-04T09:14:59.4178007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4178508Z Autotune Choices Stats: 2025-12-04T09:14:59.4181271Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4184481Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4185472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4186578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4189472Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4193783Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4198077Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4202332Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4206583Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4210842Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4213619Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:59.4214331Z Autotune Choices Stats: 2025-12-04T09:14:59.4217166Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4220719Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4222309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4224087Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4227471Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4231935Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4236332Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4240732Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4245118Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4249562Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4254056Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4258500Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4263054Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4267520Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4270243Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:59.4271068Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4271572Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4271912Z unimplemented [] 2025-12-04T09:14:59.4272263Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4272918Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4275614Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4278135Z graph_break [] 2025-12-04T09:14:59.4278528Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4279027Z Autotune Choices Stats: 2025-12-04T09:14:59.4281776Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4284888Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4285943Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4287069Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4289948Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4294294Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4298625Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4302884Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4307139Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4311461Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4314107Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:59.4314806Z Autotune Choices Stats: 2025-12-04T09:14:59.4317674Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4321232Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4322813Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4324600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4327995Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4332468Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4336879Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4341281Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4345727Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4350163Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4354615Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4359005Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4363587Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4367979Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4370694Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:59.4371519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4372011Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4372399Z unimplemented [] 2025-12-04T09:14:59.4372759Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4373410Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4376102Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4378634Z graph_break [] 2025-12-04T09:14:59.4379031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4379528Z Autotune Choices Stats: 2025-12-04T09:14:59.4382342Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4385466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4386446Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4387622Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4390435Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4394768Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4399023Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4403284Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4407538Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4411847Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4414544Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:59.4415254Z Autotune Choices Stats: 2025-12-04T09:14:59.4418160Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4421756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4423298Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4425160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4428505Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4451274Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4455849Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4460360Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4464942Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4469425Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4473902Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4478357Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4482730Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4487105Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4489809Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:59.4490651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4491234Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4491576Z unimplemented [] 2025-12-04T09:14:59.4491938Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4492685Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4495388Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4497827Z graph_break [] 2025-12-04T09:14:59.4498217Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4498722Z Autotune Choices Stats: 2025-12-04T09:14:59.4501531Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.4504687Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4505668Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4506776Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4509633Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4513882Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4518121Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4522368Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4526660Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4530904Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4533652Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:59.4534366Z Autotune Choices Stats: 2025-12-04T09:14:59.4537234Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4540786Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4542370Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4544143Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4547480Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4551896Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4556300Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4560748Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4565370Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4569854Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4574306Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4578766Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4583154Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4587543Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4590336Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:59.4591157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4591658Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4591998Z unimplemented [] 2025-12-04T09:14:59.4592357Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4593010Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4595700Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4598196Z graph_break [] 2025-12-04T09:14:59.4598593Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4599081Z Autotune Choices Stats: 2025-12-04T09:14:59.4601877Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4604991Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4606014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4607137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4609934Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4614278Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4618541Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4622859Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4627116Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4631419Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4634061Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:59.4634807Z Autotune Choices Stats: 2025-12-04T09:14:59.4637632Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.4641217Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4642754Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4644525Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4647866Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4652266Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4656782Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4661230Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4665847Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4670241Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4674687Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4679059Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4683132Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4685591Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4687064Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:59.4687534Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4687819Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4688011Z unimplemented [] 2025-12-04T09:14:59.4688220Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4688583Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4690102Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4691420Z graph_break [] 2025-12-04T09:14:59.4691651Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4691962Z Autotune Choices Stats: 2025-12-04T09:14:59.4693576Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.4695317Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4695850Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4696461Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4697987Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4700295Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4702606Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4704941Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4707267Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4709593Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4711023Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:59.4711446Z Autotune Choices Stats: 2025-12-04T09:14:59.4712985Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4714896Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4715720Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4716679Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4718507Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4720948Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4723323Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4725727Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4728124Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4730524Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4732962Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4735341Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4737709Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4740135Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4741603Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.4742065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4742351Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4742582Z unimplemented [] 2025-12-04T09:14:59.4742786Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4743150Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4744632Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4745936Z graph_break [] 2025-12-04T09:14:59.4746164Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4746480Z Autotune Choices Stats: 2025-12-04T09:14:59.4747994Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4749705Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4750248Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4750868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4752398Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4754707Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4757060Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4759401Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4761736Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4764374Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4765926Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.4766325Z Autotune Choices Stats: 2025-12-04T09:14:59.4767876Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4769807Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4770636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4771593Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4773479Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4775933Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4778354Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4780760Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4783164Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4785534Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4787911Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4790285Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4792691Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4795095Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4796563Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.4797030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4797310Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4797511Z unimplemented [] 2025-12-04T09:14:59.4797720Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4798113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4799573Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4800907Z graph_break [] 2025-12-04T09:14:59.4801136Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4801411Z Autotune Choices Stats: 2025-12-04T09:14:59.4802921Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4804620Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4805161Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4805771Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4807300Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4808459Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4809584Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4810734Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4811889Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4813142Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4813393Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.4813462Z Autotune Choices Stats: 2025-12-04T09:14:59.4814895Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4815330Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4815662Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4816262Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4817434Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4818621Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4819808Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4820967Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4822150Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4823310Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4824457Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4825646Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4826803Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4827989Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4828274Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.4828410Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4828486Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4828548Z unimplemented [] 2025-12-04T09:14:59.4828654Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4828889Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4830073Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4830143Z graph_break [] 2025-12-04T09:14:59.4830273Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4830341Z Autotune Choices Stats: 2025-12-04T09:14:59.4831733Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.4831978Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4832202Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4832520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4833686Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4834801Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4835954Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4837102Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4838256Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4839375Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4839622Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.4839692Z Autotune Choices Stats: 2025-12-04T09:14:59.4841120Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4841550Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4842504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4843064Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4844273Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4845480Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4846641Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4847826Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4848982Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4850139Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4851287Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4852538Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4853729Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4854906Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4855162Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.4855327Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4855397Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4855467Z unimplemented [] 2025-12-04T09:14:59.4855571Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4855763Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4856937Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4857008Z graph_break [] 2025-12-04T09:14:59.4857139Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4857208Z Autotune Choices Stats: 2025-12-04T09:14:59.4858612Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.4858857Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4859131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4859449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4860582Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4861728Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4863216Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4864363Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4865552Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4866686Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4866939Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.4867008Z Autotune Choices Stats: 2025-12-04T09:14:59.4868456Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4868941Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4869286Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4869846Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4871071Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4872260Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4873458Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4874620Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4875772Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4876934Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4878117Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4879328Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4880518Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4881660Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4881952Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.4882086Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4882158Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4882230Z unimplemented [] 2025-12-04T09:14:59.4882336Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4882529Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4883707Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4883777Z graph_break [] 2025-12-04T09:14:59.4883905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4883976Z Autotune Choices Stats: 2025-12-04T09:14:59.4885373Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4885667Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4885892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4886206Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4887370Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4888516Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4889641Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4890788Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4891899Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4893105Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4893357Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.4893464Z Autotune Choices Stats: 2025-12-04T09:14:59.4894900Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.4895334Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4895663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4896252Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4897449Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4898605Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4899805Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4900965Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4902115Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4903318Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4904469Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4905649Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4906833Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4908006Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4908259Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.4908389Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4908466Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4908530Z unimplemented [] 2025-12-04T09:14:59.4908637Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4908825Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4909998Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4910064Z graph_break [] 2025-12-04T09:14:59.4910191Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4910329Z Autotune Choices Stats: 2025-12-04T09:14:59.4911718Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4911962Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4912180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4912496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4913661Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4914811Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4915976Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4917098Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4918217Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4919336Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4919614Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.4919688Z Autotune Choices Stats: 2025-12-04T09:14:59.4921119Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.4921603Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4921936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4922537Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4923698Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4924917Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4926066Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4927237Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4928423Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4929579Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4930767Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4931953Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4933191Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4934345Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4934603Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.4934734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4934815Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4934878Z unimplemented [] 2025-12-04T09:14:59.4934980Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4935171Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4936347Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4936459Z graph_break [] 2025-12-04T09:14:59.4936588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4936654Z Autotune Choices Stats: 2025-12-04T09:14:59.4938048Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4938293Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4938555Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4938875Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4940041Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4941208Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4942329Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4943458Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4944574Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4945747Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4945992Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.4946070Z Autotune Choices Stats: 2025-12-04T09:14:59.4947531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.4947979Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4948335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4948899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4950101Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4951261Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4952411Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4953566Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4954751Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4955942Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4957129Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4958276Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4959471Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4960621Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4960888Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.4961022Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4961102Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4961165Z unimplemented [] 2025-12-04T09:14:59.4961266Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4961499Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4962825Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4962894Z graph_break [] 2025-12-04T09:14:59.4963022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4963087Z Autotune Choices Stats: 2025-12-04T09:14:59.4964689Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4964957Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4965236Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4965557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4966702Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4967872Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4969000Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4970130Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4971287Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4972454Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4972699Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.4972773Z Autotune Choices Stats: 2025-12-04T09:14:59.4974290Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.4974730Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4975091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4975647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4976819Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4977983Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4979142Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4980328Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4981572Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4982763Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.4983918Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.4985103Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.4986257Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4987409Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.4987661Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.4987826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.4987903Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.4987968Z unimplemented [] 2025-12-04T09:14:59.4988069Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.4988266Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.4989451Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.4989520Z graph_break [] 2025-12-04T09:14:59.4989650Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.4989720Z Autotune Choices Stats: 2025-12-04T09:14:59.4991186Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.4991428Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.4991655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.4992008Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.4993145Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4994260Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4995393Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.4996506Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4997664Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.4998817Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.4999062Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.4999128Z Autotune Choices Stats: 2025-12-04T09:14:59.5000610Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5001079Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5001406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5001976Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5003151Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5004313Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5005501Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5006660Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5007858Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5009044Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5010234Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5011386Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5012588Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5013732Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5014019Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.5014148Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5014222Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5014283Z unimplemented [] 2025-12-04T09:14:59.5014383Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5014572Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5015795Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5015864Z graph_break [] 2025-12-04T09:14:59.5015991Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5016055Z Autotune Choices Stats: 2025-12-04T09:14:59.5017482Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.5017757Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5017989Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5018302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5019444Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5020562Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5021682Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5022834Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5023989Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5025139Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5025379Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.5025444Z Autotune Choices Stats: 2025-12-04T09:14:59.5026876Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5027351Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5027673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5028240Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5029415Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5030578Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5031765Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5032951Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5034131Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5035283Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5036470Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5037616Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5038771Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5039981Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5040233Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.5040362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5040436Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5040499Z unimplemented [] 2025-12-04T09:14:59.5040598Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5040823Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5042031Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5042098Z graph_break [] 2025-12-04T09:14:59.5042224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5042287Z Autotune Choices Stats: 2025-12-04T09:14:59.5043684Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5043960Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5044182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5044496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5045624Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5046740Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5047899Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5049012Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5050162Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5051314Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5051592Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.5051659Z Autotune Choices Stats: 2025-12-04T09:14:59.5053132Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5053563Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5053894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5054451Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5055620Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5056825Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5058014Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5059212Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5060363Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5061560Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5062901Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5064086Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5065327Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5066489Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5066747Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.5066925Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5067002Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5067065Z unimplemented [] 2025-12-04T09:14:59.5067172Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5067363Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5068593Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5068726Z graph_break [] 2025-12-04T09:14:59.5068857Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5068924Z Autotune Choices Stats: 2025-12-04T09:14:59.5070329Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5070571Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5070811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5071130Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5072261Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5073415Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5074538Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5075680Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5076832Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5077951Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5078230Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.5078296Z Autotune Choices Stats: 2025-12-04T09:14:59.5079740Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5080177Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5080506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5081064Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5082267Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5083420Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5084615Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5085796Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5086990Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5088149Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5089300Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5090448Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5091647Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5092901Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5093161Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.5093291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5093395Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5093465Z unimplemented [] 2025-12-04T09:14:59.5093569Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5093759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5094935Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5095037Z graph_break [] 2025-12-04T09:14:59.5095162Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5095227Z Autotune Choices Stats: 2025-12-04T09:14:59.5096621Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5096865Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5097093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5097416Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5098547Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5099700Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5100849Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5102009Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5103127Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5104284Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5104525Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.5104593Z Autotune Choices Stats: 2025-12-04T09:14:59.5106026Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.5106457Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5106786Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5107384Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5108556Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5109740Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5110929Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5112125Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5113275Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5114428Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5115572Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5116816Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5118001Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5119180Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5119429Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.5119560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5119663Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5119732Z unimplemented [] 2025-12-04T09:14:59.5119834Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5120039Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5121211Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5121281Z graph_break [] 2025-12-04T09:14:59.5121410Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5121478Z Autotune Choices Stats: 2025-12-04T09:14:59.5122876Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5123120Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5123344Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5123692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5124824Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5125976Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5127095Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5128236Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5129398Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5130510Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5130762Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.5130827Z Autotune Choices Stats: 2025-12-04T09:14:59.5132260Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5132771Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5133100Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5133660Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5134875Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5136065Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5137217Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5138410Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5139567Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5140714Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5141886Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5143036Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5144230Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5145403Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5145687Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.5145816Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5145884Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5145951Z unimplemented [] 2025-12-04T09:14:59.5146050Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5146238Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5147414Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5147475Z graph_break [] 2025-12-04T09:14:59.5147608Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5147672Z Autotune Choices Stats: 2025-12-04T09:14:59.5149067Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5149353Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5149570Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5149882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5151009Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5152146Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5153293Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5154432Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5155553Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5156665Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5156911Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.5156976Z Autotune Choices Stats: 2025-12-04T09:14:59.5158413Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.5158879Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5159209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5159809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5161009Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5162355Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5163589Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5164748Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5165894Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5167044Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5168258Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5169472Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5170668Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5171858Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5172112Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.5172245Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5172355Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5172429Z unimplemented [] 2025-12-04T09:14:59.5172532Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5172731Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5173915Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5173975Z graph_break [] 2025-12-04T09:14:59.5174114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5174180Z Autotune Choices Stats: 2025-12-04T09:14:59.5175567Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5175852Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5176075Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5176391Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5177550Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5178690Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5179828Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5180971Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5182093Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5183211Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5183489Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.5183558Z Autotune Choices Stats: 2025-12-04T09:14:59.5184996Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5185427Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5185794Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5186349Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5187548Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5188733Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5189901Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5191057Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5192202Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5193387Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5194558Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5195739Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5196905Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5198118Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5198370Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.5198497Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5198567Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5198637Z unimplemented [] 2025-12-04T09:14:59.5198735Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5198916Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5200098Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5200194Z graph_break [] 2025-12-04T09:14:59.5200328Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5200394Z Autotune Choices Stats: 2025-12-04T09:14:59.5201784Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.5202024Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5202246Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5202590Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5203739Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5204859Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5206013Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5207122Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5208243Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5209364Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5209641Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.5209706Z Autotune Choices Stats: 2025-12-04T09:14:59.5211164Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5211596Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5211959Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5212557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5213729Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5214924Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5216086Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5217241Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5218421Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5219605Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5220794Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5221954Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5223132Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5224280Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5224528Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.5224653Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5224719Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5224786Z unimplemented [] 2025-12-04T09:14:59.5224889Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5225069Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5226252Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5226345Z graph_break [] 2025-12-04T09:14:59.5226481Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5226547Z Autotune Choices Stats: 2025-12-04T09:14:59.5227985Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.5228228Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5228449Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5228762Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5229930Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5231078Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5232198Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5233312Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5234430Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5235577Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5235820Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.5235886Z Autotune Choices Stats: 2025-12-04T09:14:59.5237342Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5237805Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5238132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5238718Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5239899Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5241053Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5242211Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5243388Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5244543Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5245726Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5246900Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5248130Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5249287Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5250442Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5250696Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.5250824Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5250925Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5250992Z unimplemented [] 2025-12-04T09:14:59.5251094Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5251276Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5252497Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5252563Z graph_break [] 2025-12-04T09:14:59.5252695Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5252760Z Autotune Choices Stats: 2025-12-04T09:14:59.5254195Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5254470Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5254692Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5255003Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5256161Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5257276Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5258395Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5259515Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5260661Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5261928Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5262281Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.5262351Z Autotune Choices Stats: 2025-12-04T09:14:59.5263856Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.5264301Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5264679Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5265242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5266423Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5267584Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5268744Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5269952Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5271137Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5272319Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5273469Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5274652Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5275804Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5276952Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5277238Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.5277379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5277448Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5277515Z unimplemented [] 2025-12-04T09:14:59.5277618Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5277803Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5278990Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5279052Z graph_break [] 2025-12-04T09:14:59.5279220Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5279287Z Autotune Choices Stats: 2025-12-04T09:14:59.5280709Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5280954Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5281202Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5281517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5282641Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5283755Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5284867Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5286089Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5287204Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5288351Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5288601Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.5288667Z Autotune Choices Stats: 2025-12-04T09:14:59.5290144Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.5290604Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5290933Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5291488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5292723Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5293884Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5295098Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5296281Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5297471Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5298624Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5299802Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5300956Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5302105Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5303293Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5303538Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.5303673Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5303743Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5303809Z unimplemented [] 2025-12-04T09:14:59.5303912Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5304094Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5305319Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5305383Z graph_break [] 2025-12-04T09:14:59.5305522Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5305620Z Autotune Choices Stats: 2025-12-04T09:14:59.5307015Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5307290Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5307504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5307829Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5308963Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5310089Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5311203Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5312357Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5313519Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5314665Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5314916Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:59.5315016Z Autotune Choices Stats: 2025-12-04T09:14:59.5316467Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.5316895Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5317227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5317784Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5318956Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5320153Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5321308Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5322491Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5323666Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5324873Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5326019Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5327172Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5328323Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5329518Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5329760Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:59.5329896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5329966Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5330032Z unimplemented [] 2025-12-04T09:14:59.5330165Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5330350Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5331553Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5331613Z graph_break [] 2025-12-04T09:14:59.5331746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5331850Z Autotune Choices Stats: 2025-12-04T09:14:59.5333285Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.5333533Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5333747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5334073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5335194Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5336313Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5337468Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5338611Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5339763Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5340875Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5341155Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:59.5341221Z Autotune Choices Stats: 2025-12-04T09:14:59.5342649Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.5343081Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5343407Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5343967Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5345133Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5346318Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5347517Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5348695Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5349887Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5351038Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5352185Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5353337Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5354515Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5355699Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5355947Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:59.5356082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5356150Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5356218Z unimplemented [] 2025-12-04T09:14:59.5356320Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5356552Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5357741Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5357834Z graph_break [] 2025-12-04T09:14:59.5357965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5358032Z Autotune Choices Stats: 2025-12-04T09:14:59.5359418Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.5359666Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5359881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5360199Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5361317Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5362645Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5363787Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5364975Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5366143Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5367302Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5367553Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:59.5367621Z Autotune Choices Stats: 2025-12-04T09:14:59.5369063Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:59.5369514Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5369844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5370453Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5371627Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5372866Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5374066Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5375219Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5376408Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5377559Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5378709Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5379902Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5381054Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5382248Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5382572Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:59.5382715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5382783Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5382850Z unimplemented [] 2025-12-04T09:14:59.5382955Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5383184Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5384370Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5384429Z graph_break [] 2025-12-04T09:14:59.5384562Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5384627Z Autotune Choices Stats: 2025-12-04T09:14:59.5386022Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5386274Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5386490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5386809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5387972Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5389096Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5390259Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5391416Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5392567Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5393688Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5393938Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:59.5394004Z Autotune Choices Stats: 2025-12-04T09:14:59.5395431Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.5395860Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5396224Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5396776Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5397950Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5399134Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5400321Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5401502Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5402649Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5403806Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5404956Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5406140Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5407321Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5408507Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5408750Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:59.5408917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5408983Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5409047Z unimplemented [] 2025-12-04T09:14:59.5409156Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5409338Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5410518Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5410578Z graph_break [] 2025-12-04T09:14:59.5410718Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5410785Z Autotune Choices Stats: 2025-12-04T09:14:59.5412186Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5412471Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5412728Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5413046Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5414172Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5415322Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5416471Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5417593Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5418741Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5419849Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5420100Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:59.5420167Z Autotune Choices Stats: 2025-12-04T09:14:59.5421599Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.5422080Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5422410Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5422967Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5424176Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5425360Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5426548Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5427698Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5428850Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5429998Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5431182Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5432363Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5433555Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5434715Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5434989Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:59.5435126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5435193Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5435259Z unimplemented [] 2025-12-04T09:14:59.5435370Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5435556Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5436736Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5436798Z graph_break [] 2025-12-04T09:14:59.5436932Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5436998Z Autotune Choices Stats: 2025-12-04T09:14:59.5438386Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.5438672Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5438889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5439212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5440383Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5441533Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5442640Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5443791Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5444910Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5446026Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5446272Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:59.5446372Z Autotune Choices Stats: 2025-12-04T09:14:59.5447802Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.5448237Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5448558Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5449157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5450347Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5451506Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5452816Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5453967Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5455132Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5456332Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5457485Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5458672Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5459858Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5461043Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5461285Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:59.5461419Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5461489Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5461556Z unimplemented [] 2025-12-04T09:14:59.5461663Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5462005Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5463192Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5463252Z graph_break [] 2025-12-04T09:14:59.5463379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5463515Z Autotune Choices Stats: 2025-12-04T09:14:59.5464901Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.5465153Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5465382Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5465706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5466877Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5468040Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5469244Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5470371Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5471494Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5472611Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5472896Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:14:59.5472960Z Autotune Choices Stats: 2025-12-04T09:14:59.5474387Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.5474860Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5475187Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5475781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5476944Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5478135Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5479292Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5480437Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5481589Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5482764Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5483947Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5485149Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5486326Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5487476Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5487725Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:14:59.5487860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5487928Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5487992Z unimplemented [] 2025-12-04T09:14:59.5488100Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5488281Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5489465Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5489562Z graph_break [] 2025-12-04T09:14:59.5489688Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5489761Z Autotune Choices Stats: 2025-12-04T09:14:59.5491146Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.5491394Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5491642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5491963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5493186Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5494307Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5495453Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5496570Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5497675Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5498825Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5499078Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:14:59.5499144Z Autotune Choices Stats: 2025-12-04T09:14:59.5500603Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:14:59.5501044Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5501399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5501962Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5503160Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5504330Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5505492Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5506636Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5507819Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5508995Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5510183Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5511333Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5512510Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5513665Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5513908Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:14:59.5514041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5514108Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5514169Z unimplemented [] 2025-12-04T09:14:59.5514274Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5514454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5515660Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5515721Z graph_break [] 2025-12-04T09:14:59.5515847Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5515920Z Autotune Choices Stats: 2025-12-04T09:14:59.5517394Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.5517642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5517855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5518204Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5519334Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5520478Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5521594Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5522711Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5523818Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5524965Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5525213Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:14:59.5525279Z Autotune Choices Stats: 2025-12-04T09:14:59.5526765Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.5527199Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5527524Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5528119Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5529292Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5530453Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5531607Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5532837Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5534028Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5535208Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5536361Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5537550Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5538696Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5539849Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5540090Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:14:59.5540256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5540325Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5540387Z unimplemented [] 2025-12-04T09:14:59.5540492Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5540673Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5541841Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5541906Z graph_break [] 2025-12-04T09:14:59.5542032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5542105Z Autotune Choices Stats: 2025-12-04T09:14:59.5543551Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.5543798Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5544011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5544367Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5545492Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5546604Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5547714Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5548835Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5550003Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5551151Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5551394Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:14:59.5551466Z Autotune Choices Stats: 2025-12-04T09:14:59.5552920Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.5553401Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5553726Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5560254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5561489Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5562900Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5564189Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5565366Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5566601Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5567806Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5569020Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5570195Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5571373Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5572587Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5572903Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:14:59.5573052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5573126Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5573191Z unimplemented [] 2025-12-04T09:14:59.5573304Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5573494Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5574709Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5574782Z graph_break [] 2025-12-04T09:14:59.5574915Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5574989Z Autotune Choices Stats: 2025-12-04T09:14:59.5576416Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5576705Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5576928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5577250Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5578385Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5579511Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5580624Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5581769Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5582911Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5584029Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5584309Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:14:59.5584384Z Autotune Choices Stats: 2025-12-04T09:14:59.5585809Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.5586309Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5586640Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5587207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5588369Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5589525Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5590711Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5591893Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5593073Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5594218Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5595421Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5596574Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5597728Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5598926Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5599172Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:14:59.5599350Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:59.5599431Z Traceback (most recent call last): 2025-12-04T09:14:59.5599733Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:59.5599804Z self.assertTrue( 2025-12-04T09:14:59.5600034Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:59.5600116Z raise self.failureException(msg) 2025-12-04T09:14:59.5600362Z AssertionError: False is not true : Log file /tmp/tmp06g1qqtu/flex_attention_configs.json was not created 2025-12-04T09:14:59.5600367Z 2025-12-04T09:14:59.5600502Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:59.5600760Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:59.5600765Z 2025-12-04T09:14:59.5600962Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:59.5601098Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5601178Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5601242Z unimplemented [] 2025-12-04T09:14:59.5601379Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5602565Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:59.5602767Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5602835Z graph_break [] 2025-12-04T09:14:59.5602962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5603961Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:59.5604044Z current_size = base.storage().size() 2025-12-04T09:14:59.5604114Z Autotune Choices Stats: 2025-12-04T09:14:59.5605514Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.5605809Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5606038Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5606354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5607486Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5608627Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5609775Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5610915Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5612029Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5613190Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5613437Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:59.5613509Z Autotune Choices Stats: 2025-12-04T09:14:59.5614932Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.5615410Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5615738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5616296Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5617495Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5618704Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5619899Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5621056Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5622208Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5623357Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5624535Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5625706Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5626889Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5628034Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5628319Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:59.5628452Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5628528Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5628593Z unimplemented [] 2025-12-04T09:14:59.5628699Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5628893Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5630080Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5630146Z graph_break [] 2025-12-04T09:14:59.5630272Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5630341Z Autotune Choices Stats: 2025-12-04T09:14:59.5631729Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5632009Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5632236Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5632547Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5633705Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5634842Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5635957Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5637097Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5638208Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5639322Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5639598Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:59.5639665Z Autotune Choices Stats: 2025-12-04T09:14:59.5641094Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5641544Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5641904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5642463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5643665Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5644851Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5645994Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5647140Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5648281Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5649472Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5650666Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5651845Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5653034Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5654215Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5654464Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:59.5654593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5654667Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5654731Z unimplemented [] 2025-12-04T09:14:59.5654832Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5655020Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5656193Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5656258Z graph_break [] 2025-12-04T09:14:59.5656483Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5656552Z Autotune Choices Stats: 2025-12-04T09:14:59.5657943Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5658185Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5658407Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5658756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5659915Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5661025Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5662360Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5663502Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5664627Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5665744Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5666070Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:59.5666139Z Autotune Choices Stats: 2025-12-04T09:14:59.5667581Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5668063Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5668399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5669005Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5670184Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5671396Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5672547Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5673706Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5674896Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5676048Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5677227Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5678413Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5679598Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5680748Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5681003Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:59.5681134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5681211Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5681281Z unimplemented [] 2025-12-04T09:14:59.5681387Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5681579Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5682781Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5682905Z graph_break [] 2025-12-04T09:14:59.5683033Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5683101Z Autotune Choices Stats: 2025-12-04T09:14:59.5684500Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5684794Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5685030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5685349Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5686515Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5687660Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5688784Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5689909Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5691018Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5692169Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5692472Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:59.5692540Z Autotune Choices Stats: 2025-12-04T09:14:59.5694022Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5694491Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5694826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5695387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5696600Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5697761Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5698922Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5700083Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5701263Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5702449Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5703627Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5704779Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5705959Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5707110Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5707359Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:59.5707485Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5707555Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5707624Z unimplemented [] 2025-12-04T09:14:59.5707760Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5707952Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5709127Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5709193Z graph_break [] 2025-12-04T09:14:59.5709329Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5709393Z Autotune Choices Stats: 2025-12-04T09:14:59.5710815Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5711060Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5711310Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5711621Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5712750Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5713908Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5715033Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5716147Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5717294Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5718418Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5718674Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:59.5718773Z Autotune Choices Stats: 2025-12-04T09:14:59.5720244Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5720676Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5721040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5721598Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5722770Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5723926Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5725081Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5726267Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5727440Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5728621Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5729766Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5730948Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5732095Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5733310Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5733780Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:59.5733914Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5733984Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5734053Z unimplemented [] 2025-12-04T09:14:59.5734156Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5734344Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5735521Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5735587Z graph_break [] 2025-12-04T09:14:59.5735726Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5735826Z Autotune Choices Stats: 2025-12-04T09:14:59.5737249Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.5737493Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5737716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5738066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5739197Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5740310Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5741427Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5742539Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5743709Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5744852Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5745111Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:59.5745180Z Autotune Choices Stats: 2025-12-04T09:14:59.5746668Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5747133Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5747468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5748028Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5749208Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5750366Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5751562Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5752722Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5753905Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5755096Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5756279Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5757433Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5758589Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5759746Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5760033Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:59.5760159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5760229Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5760297Z unimplemented [] 2025-12-04T09:14:59.5760401Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5760589Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5761968Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5762046Z graph_break [] 2025-12-04T09:14:59.5762189Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5762257Z Autotune Choices Stats: 2025-12-04T09:14:59.5763730Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5764032Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5764257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5764570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5765714Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5766835Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5767957Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5769128Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5770287Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5771442Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5771695Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:59.5771764Z Autotune Choices Stats: 2025-12-04T09:14:59.5773252Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.5773724Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5774058Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5774619Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5775791Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5776946Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5778137Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5779347Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5780533Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5781724Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5782869Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5784020Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5785171Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5786354Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5786603Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:59.5786732Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5786803Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5786873Z unimplemented [] 2025-12-04T09:14:59.5786978Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5787192Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5788409Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5788470Z graph_break [] 2025-12-04T09:14:59.5788603Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5788669Z Autotune Choices Stats: 2025-12-04T09:14:59.5790062Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.5790386Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5790620Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5790933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5792060Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5793179Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5794338Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5795450Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5796600Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5797746Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5798047Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:59.5798115Z Autotune Choices Stats: 2025-12-04T09:14:59.5799554Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5799986Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5800317Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5800872Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5802040Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5803226Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5804407Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5805599Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5806746Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5807934Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5809087Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5810249Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5811459Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5812660Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5812948Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.5813078Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5813146Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5813215Z unimplemented [] 2025-12-04T09:14:59.5813316Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5813506Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5814722Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5814819Z graph_break [] 2025-12-04T09:14:59.5814953Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5815020Z Autotune Choices Stats: 2025-12-04T09:14:59.5816427Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5816675Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5816905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5817216Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5818349Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5819493Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5820608Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5821754Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5822903Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5824047Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5824292Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.5824360Z Autotune Choices Stats: 2025-12-04T09:14:59.5825792Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5826224Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5826552Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5827108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5828309Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5829495Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5830684Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5831838Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5833014Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5834168Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5835317Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5836466Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5837649Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5838828Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5839086Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.5839217Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5839316Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5839391Z unimplemented [] 2025-12-04T09:14:59.5839490Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5839671Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5840851Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5840962Z graph_break [] 2025-12-04T09:14:59.5841092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5841158Z Autotune Choices Stats: 2025-12-04T09:14:59.5842550Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5842791Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5843015Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5843330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5844458Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5845610Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5846758Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5847908Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5849028Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5850193Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5850436Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.5850507Z Autotune Choices Stats: 2025-12-04T09:14:59.5851936Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5852406Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5852788Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5853348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5854522Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5855705Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5856885Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5858070Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5859220Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5860384Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5861531Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5862882Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5864117Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5865321Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5865573Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.5865704Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5865818Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5865888Z unimplemented [] 2025-12-04T09:14:59.5865988Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5866171Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5867359Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5867420Z graph_break [] 2025-12-04T09:14:59.5867554Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5867625Z Autotune Choices Stats: 2025-12-04T09:14:59.5869024Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.5869266Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5869492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5869868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5871000Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5872155Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5873341Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5874457Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5875608Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5876722Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5876974Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.5877042Z Autotune Choices Stats: 2025-12-04T09:14:59.5878471Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5878942Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5879271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5879842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5881049Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5882236Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5883395Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5884576Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5885729Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5886880Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5888063Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5889225Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5890414Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5891598Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5891882Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.5892013Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5892082Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5892150Z unimplemented [] 2025-12-04T09:14:59.5892255Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5892515Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5893711Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5893773Z graph_break [] 2025-12-04T09:14:59.5893907Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5893974Z Autotune Choices Stats: 2025-12-04T09:14:59.5895373Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.5895657Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5895880Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5896197Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5897328Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5898479Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5899633Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5900783Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5901901Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5903019Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5903267Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.5903333Z Autotune Choices Stats: 2025-12-04T09:14:59.5904760Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5905231Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5905558Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5906169Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5907371Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5908529Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5909735Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5910895Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5912051Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5913236Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5914388Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5915573Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5916754Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5917940Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5918190Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.5918320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5918389Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5918456Z unimplemented [] 2025-12-04T09:14:59.5918560Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5918748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5919936Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5919996Z graph_break [] 2025-12-04T09:14:59.5920130Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5920196Z Autotune Choices Stats: 2025-12-04T09:14:59.5921584Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5921860Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5922085Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5922402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5923558Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5924757Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5925876Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5927021Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5928148Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5929272Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5929573Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.5929640Z Autotune Choices Stats: 2025-12-04T09:14:59.5931071Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.5931500Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5931868Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5932497Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5933702Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5934890Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5936054Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5937207Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5938368Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5939585Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5940768Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5941955Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5943103Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5944293Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5944543Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.5944678Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5944747Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5944815Z unimplemented [] 2025-12-04T09:14:59.5944918Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5945099Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5946282Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5946378Z graph_break [] 2025-12-04T09:14:59.5946513Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5946577Z Autotune Choices Stats: 2025-12-04T09:14:59.5947965Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5948203Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5948459Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5948783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5949940Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5951063Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5952218Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5953325Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5954446Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5955556Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5955835Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.5955900Z Autotune Choices Stats: 2025-12-04T09:14:59.5957353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.5957787Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5958146Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5958704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5959907Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5961054Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5962376Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5963530Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5964747Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5965945Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5967140Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5968306Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5969521Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5970674Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5970924Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.5971053Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5971123Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5971190Z unimplemented [] 2025-12-04T09:14:59.5971293Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5971475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5972716Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5972821Z graph_break [] 2025-12-04T09:14:59.5972962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5973030Z Autotune Choices Stats: 2025-12-04T09:14:59.5974470Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.5974719Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5974935Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5975289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5976425Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5977581Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5978699Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5979817Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.5980935Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.5982083Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5982334Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.5982403Z Autotune Choices Stats: 2025-12-04T09:14:59.5983875Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.5984346Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.5984678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.5985271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.5986445Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5987602Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5988772Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5989956Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5991110Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5992297Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.5993483Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.5994673Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.5995821Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.5996972Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.5997216Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.5997354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.5997458Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.5997529Z unimplemented [] 2025-12-04T09:14:59.5997634Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.5997818Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.5999000Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.5999062Z graph_break [] 2025-12-04T09:14:59.5999193Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.5999261Z Autotune Choices Stats: 2025-12-04T09:14:59.6000676Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6000969Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6001188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6001512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6002667Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6003790Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6004920Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6006029Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6007198Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6008339Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6008590Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.6008654Z Autotune Choices Stats: 2025-12-04T09:14:59.6010109Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.6010578Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6010902Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6011463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6012687Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6013848Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6014999Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6016184Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6017371Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6018556Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6019707Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6020898Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6022050Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6023204Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6023481Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.6023614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6023682Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6023747Z unimplemented [] 2025-12-04T09:14:59.6023848Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6024029Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6025252Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6025316Z graph_break [] 2025-12-04T09:14:59.6025448Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6025512Z Autotune Choices Stats: 2025-12-04T09:14:59.6026927Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6027206Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6027423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6027745Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6028864Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6029984Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6031099Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6032248Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6033374Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6034542Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6034793Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.6034889Z Autotune Choices Stats: 2025-12-04T09:14:59.6036318Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6036781Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6037108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6037668Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6038839Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6039993Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6041184Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6042371Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6043561Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6044714Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6045896Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6047049Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6048198Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6049384Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6049628Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.6049763Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6049831Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6049899Z unimplemented [] 2025-12-04T09:14:59.6050000Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6050182Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6051392Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6051452Z graph_break [] 2025-12-04T09:14:59.6051584Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6051680Z Autotune Choices Stats: 2025-12-04T09:14:59.6053130Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.6053420Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6053639Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6053959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6055088Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6056211Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6057327Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6058485Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6059680Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6060827Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6061077Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.6061177Z Autotune Choices Stats: 2025-12-04T09:14:59.6062763Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6063211Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6063553Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6064118Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6065300Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6066546Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6067705Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6068903Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6070129Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6071340Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6072492Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6073653Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6074797Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6075983Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6076227Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.6076363Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6076467Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6076530Z unimplemented [] 2025-12-04T09:14:59.6076638Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6076818Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6078027Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6078090Z graph_break [] 2025-12-04T09:14:59.6078220Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6078329Z Autotune Choices Stats: 2025-12-04T09:14:59.6079731Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6079978Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6080193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6080517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6081643Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6082754Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6083910Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6085059Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6086206Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6087318Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6087599Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.6087667Z Autotune Choices Stats: 2025-12-04T09:14:59.6089106Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6089542Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6089868Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6090427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6091631Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6092880Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6094071Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6095250Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6096456Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6097605Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6098760Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6099923Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6101103Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6102283Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6102531Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.6102675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6102745Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6102810Z unimplemented [] 2025-12-04T09:14:59.6102919Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6103132Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6104319Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6104414Z graph_break [] 2025-12-04T09:14:59.6104544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6104610Z Autotune Choices Stats: 2025-12-04T09:14:59.6105994Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6106245Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6106459Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6106778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6107901Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6109057Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6110169Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6111328Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6112484Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6113630Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6113879Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.6113945Z Autotune Choices Stats: 2025-12-04T09:14:59.6115373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6115808Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6116131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6116731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6117894Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6119085Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6120273Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6121429Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6122625Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6123779Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6124943Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6126135Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6127279Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6128463Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6128754Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.6128887Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6128954Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6129027Z unimplemented [] 2025-12-04T09:14:59.6129131Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6129347Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6130526Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6130587Z graph_break [] 2025-12-04T09:14:59.6130715Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6130798Z Autotune Choices Stats: 2025-12-04T09:14:59.6132189Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6132481Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6132699Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6133019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6134181Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6135307Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6136455Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6137605Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6138757Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6139871Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6140121Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.6140187Z Autotune Choices Stats: 2025-12-04T09:14:59.6141607Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.6142078Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6142400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6142960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6144154Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6145344Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6146503Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6147693Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6148850Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6150010Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6151165Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6152350Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6153526Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6154712Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6155001Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.6155140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6155208Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6155269Z unimplemented [] 2025-12-04T09:14:59.6155380Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6155560Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6156737Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6156798Z graph_break [] 2025-12-04T09:14:59.6156924Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6156994Z Autotune Choices Stats: 2025-12-04T09:14:59.6158381Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6158630Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6158885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6159204Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6160332Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6161499Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6162827Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6163960Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6165121Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6166246Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6166498Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.6166562Z Autotune Choices Stats: 2025-12-04T09:14:59.6167988Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6168480Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6168816Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6169387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6170597Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6171786Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6173026Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6174175Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6175335Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6176485Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6177917Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6179102Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6180283Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6181444Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6181722Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.6181860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6181928Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6181992Z unimplemented [] 2025-12-04T09:14:59.6182103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6182284Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6183467Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6183531Z graph_break [] 2025-12-04T09:14:59.6183656Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6183728Z Autotune Choices Stats: 2025-12-04T09:14:59.6185111Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6185399Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6185615Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6185943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6187100Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6188245Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6189368Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6190534Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6191654Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6192777Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6193032Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.6193194Z Autotune Choices Stats: 2025-12-04T09:14:59.6194627Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.6195069Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6195423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6195992Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6197186Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6198336Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6199530Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6200681Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6201835Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6203015Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6204196Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6205388Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6206535Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6207716Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6207959Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.6208092Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6208163Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6208225Z unimplemented [] 2025-12-04T09:14:59.6208334Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6208515Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6209695Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6209754Z graph_break [] 2025-12-04T09:14:59.6209880Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6209986Z Autotune Choices Stats: 2025-12-04T09:14:59.6211376Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6211626Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6211840Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6212163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6213382Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6214525Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6215670Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6216795Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6217909Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6219027Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6219305Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.6219376Z Autotune Choices Stats: 2025-12-04T09:14:59.6220812Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6221279Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6221603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6222197Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6223361Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6224579Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6225729Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6226878Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6228068Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6229218Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6230401Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6231584Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6232764Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6233916Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6234163Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.6234295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6234361Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6234422Z unimplemented [] 2025-12-04T09:14:59.6234529Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6234713Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6235895Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6236005Z graph_break [] 2025-12-04T09:14:59.6236134Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6236206Z Autotune Choices Stats: 2025-12-04T09:14:59.6237589Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.6237873Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6238089Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6238411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6239571Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6240720Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6241829Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6242945Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6244056Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6245204Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6245443Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.6245513Z Autotune Choices Stats: 2025-12-04T09:14:59.6246966Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6247406Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6247757Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6248328Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6249524Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6250681Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6251837Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6253044Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6254246Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6255424Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6256625Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6257772Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6258955Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6260108Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6260352Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.6260484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6260552Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6260613Z unimplemented [] 2025-12-04T09:14:59.6260717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6260935Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6262257Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6262328Z graph_break [] 2025-12-04T09:14:59.6262459Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6262537Z Autotune Choices Stats: 2025-12-04T09:14:59.6263991Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.6264247Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6264506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6264832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6265963Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6267124Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6268239Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6269360Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6270517Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6271641Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6271894Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.6271967Z Autotune Choices Stats: 2025-12-04T09:14:59.6273460Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6273900Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6274260Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6274824Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6275987Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6277147Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6278302Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6279494Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6280681Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6281868Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6283018Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6284212Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6285366Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6286529Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6286772Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.6286944Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6287014Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6287076Z unimplemented [] 2025-12-04T09:14:59.6287184Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6287365Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6288537Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6288604Z graph_break [] 2025-12-04T09:14:59.6288732Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6288853Z Autotune Choices Stats: 2025-12-04T09:14:59.6290276Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6290526Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6290745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6291096Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6292227Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6293424Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6294542Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6295662Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6296813Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6297963Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6298209Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.6298280Z Autotune Choices Stats: 2025-12-04T09:14:59.6299738Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.6300210Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6300539Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6301105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6302285Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6303446Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6304635Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6305799Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6306985Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6308165Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6309368Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6310520Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6311674Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6312834Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6313114Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.6313250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6313322Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6313384Z unimplemented [] 2025-12-04T09:14:59.6313493Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6313680Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6314886Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6314959Z graph_break [] 2025-12-04T09:14:59.6315089Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6315162Z Autotune Choices Stats: 2025-12-04T09:14:59.6316585Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6316866Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6317085Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6317400Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6318531Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6319647Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6320761Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6321942Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6323090Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6324243Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6324494Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.6324568Z Autotune Choices Stats: 2025-12-04T09:14:59.6325998Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.6326509Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6326832Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6327399Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6328558Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6329716Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6330901Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6332091Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6333346Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6334505Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6335693Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6336841Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6337998Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6339183Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6339428Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.6339556Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6339628Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6339690Z unimplemented [] 2025-12-04T09:14:59.6339799Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6340013Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6341220Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6341288Z graph_break [] 2025-12-04T09:14:59.6341416Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6341485Z Autotune Choices Stats: 2025-12-04T09:14:59.6342876Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6343153Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6343367Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6343681Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6344817Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6345935Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6347081Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6348205Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6349364Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6350517Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6350807Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:59.6350879Z Autotune Choices Stats: 2025-12-04T09:14:59.6352298Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.6352740Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6353064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6353631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6354800Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6355997Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6357185Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6358377Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6359544Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6360740Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6362081Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6363245Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6364462Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6365627Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6365921Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:59.6366050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6366127Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6366188Z unimplemented [] 2025-12-04T09:14:59.6366299Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6366480Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6367721Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6367834Z graph_break [] 2025-12-04T09:14:59.6367964Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6368038Z Autotune Choices Stats: 2025-12-04T09:14:59.6369436Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.6369683Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6369903Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6370215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6371353Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6372564Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6373684Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6374842Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6375986Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6377107Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6377382Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:59.6377456Z Autotune Choices Stats: 2025-12-04T09:14:59.6378888Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.6379334Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6379664Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6380230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6381428Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6382584Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6383768Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6384970Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6386154Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6387298Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6388451Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6389610Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6390796Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6391981Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6392240Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:59.6392368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6392476Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6392538Z unimplemented [] 2025-12-04T09:14:59.6392641Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6392828Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6394010Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6394111Z graph_break [] 2025-12-04T09:14:59.6394235Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6394304Z Autotune Choices Stats: 2025-12-04T09:14:59.6395685Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.6395934Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6396153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6396466Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6397596Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6398742Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6399898Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6401048Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6402159Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6403309Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6403551Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:59.6403624Z Autotune Choices Stats: 2025-12-04T09:14:59.6405049Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:59.6405483Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6405842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6406410Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6407580Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6408769Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6409954Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6411139Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6412336Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6413507Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6414664Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6415845Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6417048Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6418236Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6418478Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:59.6418607Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6418728Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6418791Z unimplemented [] 2025-12-04T09:14:59.6418897Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6419084Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6420269Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6420336Z graph_break [] 2025-12-04T09:14:59.6420460Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6420535Z Autotune Choices Stats: 2025-12-04T09:14:59.6421918Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6422164Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6422381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6422733Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6423868Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6425012Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6426165Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6427285Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6428431Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6429551Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6429797Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:59.6429869Z Autotune Choices Stats: 2025-12-04T09:14:59.6431294Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.6431767Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6432098Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6432662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6433858Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6435051Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6436207Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6437394Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6438559Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6439711Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6440905Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6442058Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6443266Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6444452Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6444737Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:59.6444864Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6444940Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6445002Z unimplemented [] 2025-12-04T09:14:59.6445103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6445293Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6446474Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6446542Z graph_break [] 2025-12-04T09:14:59.6446667Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6446732Z Autotune Choices Stats: 2025-12-04T09:14:59.6448126Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6448434Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6448651Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6448965Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6450104Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6451244Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6452439Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6453603Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6454713Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6455829Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6456070Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:59.6456143Z Autotune Choices Stats: 2025-12-04T09:14:59.6457566Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.6458037Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6458365Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6458961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6460209Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6461373Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6462729Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6463898Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6465057Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6466213Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6467423Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6468612Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6469819Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6471019Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6471270Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:59.6471400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6471475Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6471536Z unimplemented [] 2025-12-04T09:14:59.6471639Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6471831Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6473007Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6473074Z graph_break [] 2025-12-04T09:14:59.6473201Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6473265Z Autotune Choices Stats: 2025-12-04T09:14:59.6474662Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.6474943Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6475170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6475488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6476652Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6477801Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6478919Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6480086Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6481199Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6482319Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6482595Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:59.6482670Z Autotune Choices Stats: 2025-12-04T09:14:59.6484105Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.6484541Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6484900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6485479Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6486684Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6487880Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6489034Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6490200Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6491344Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6492579Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6493768Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6494951Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6496118Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6497298Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6497550Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:59.6497685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6497762Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6497825Z unimplemented [] 2025-12-04T09:14:59.6497928Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6498117Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6499301Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6499404Z graph_break [] 2025-12-04T09:14:59.6499532Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6499599Z Autotune Choices Stats: 2025-12-04T09:14:59.6500990Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.6501230Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6501458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6501805Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6502967Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6504085Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6505241Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6506356Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6507470Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6508590Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6508865Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:14:59.6508938Z Autotune Choices Stats: 2025-12-04T09:14:59.6510409Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.6510855Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6511227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6511789Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6512953Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6514141Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6515289Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6516440Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6517623Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6518800Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6519983Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6521129Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6522315Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6523461Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6523709Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:14:59.6523836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6523907Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6523980Z unimplemented [] 2025-12-04T09:14:59.6524085Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6524274Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6525453Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6525554Z graph_break [] 2025-12-04T09:14:59.6525681Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6525745Z Autotune Choices Stats: 2025-12-04T09:14:59.6527171Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.6527413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6527637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6527950Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6529124Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6530271Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6531391Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6532549Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6533669Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6534826Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6535073Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:14:59.6535143Z Autotune Choices Stats: 2025-12-04T09:14:59.6536620Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:14:59.6537091Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6537421Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6538014Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6539183Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6540350Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6541503Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6542698Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6543853Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6545058Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6546241Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6547419Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6548573Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6549725Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6549973Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:14:59.6550101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6550216Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6550280Z unimplemented [] 2025-12-04T09:14:59.6550381Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6550573Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6555757Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6555870Z graph_break [] 2025-12-04T09:14:59.6556027Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6556105Z Autotune Choices Stats: 2025-12-04T09:14:59.6557617Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.6557923Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6558159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6558485Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6559680Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6560806Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6562142Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6563274Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6564490Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6565614Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6565913Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:14:59.6565987Z Autotune Choices Stats: 2025-12-04T09:14:59.6567477Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.6567926Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6568311Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6568876Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6570077Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6571231Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6572429Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6573641Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6574823Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6576012Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6577170Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6578354Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6579510Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6580657Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6580967Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:14:59.6581104Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6581181Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6581246Z unimplemented [] 2025-12-04T09:14:59.6581363Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6581556Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6582727Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6582833Z graph_break [] 2025-12-04T09:14:59.6582965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6583033Z Autotune Choices Stats: 2025-12-04T09:14:59.6584463Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.6584712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6584968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6585285Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6586417Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6587527Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6588645Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6589787Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6590909Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6592061Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6592306Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:14:59.6592405Z Autotune Choices Stats: 2025-12-04T09:14:59.6593840Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.6594309Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6594640Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6595201Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6596370Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6597530Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6598724Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6599947Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6601139Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6602291Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6603466Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6604620Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6605775Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6606955Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6607208Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:14:59.6607338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6607417Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6607481Z unimplemented [] 2025-12-04T09:14:59.6607584Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6607779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6609066Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6609140Z graph_break [] 2025-12-04T09:14:59.6609275Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6609378Z Autotune Choices Stats: 2025-12-04T09:14:59.6610784Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6611083Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6611309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6611629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6612834Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6613958Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6615078Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6616227Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6617380Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6618538Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6618785Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:14:59.6618886Z Autotune Choices Stats: 2025-12-04T09:14:59.6620331Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.6620768Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6621105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6621666Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6622832Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6624223Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6625383Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6626583Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6627773Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6628959Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6630105Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6631260Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6632408Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6633587Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6633841Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:14:59.6633974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6634046Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6634150Z unimplemented [] 2025-12-04T09:14:59.6634259Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6634446Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6635651Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6635720Z graph_break [] 2025-12-04T09:14:59.6635849Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6635952Z Autotune Choices Stats: 2025-12-04T09:14:59.6637345Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6637587Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6637811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6638131Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6639262Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6640381Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6641537Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6642677Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6643878Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6644992Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6645271Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:14:59.6645338Z Autotune Choices Stats: 2025-12-04T09:14:59.6646764Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.6647200Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6647528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6648091Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6649273Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6650464Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6651648Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6652910Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6654102Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6655261Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6656408Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6657564Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6658747Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6659933Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6660186Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:14:59.6660359Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:59.6660437Z Traceback (most recent call last): 2025-12-04T09:14:59.6660744Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:59.6660848Z self.assertTrue( 2025-12-04T09:14:59.6661055Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:59.6661141Z raise self.failureException(msg) 2025-12-04T09:14:59.6661396Z AssertionError: False is not true : Log file /tmp/tmpup5bjcux/flex_attention_configs.json was not created 2025-12-04T09:14:59.6661438Z 2025-12-04T09:14:59.6661581Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:59.6661982Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:59.6661987Z 2025-12-04T09:14:59.6662167Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:59.6662303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6662374Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6662443Z unimplemented [] 2025-12-04T09:14:59.6662549Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6663736Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:59.6663931Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6663991Z graph_break [] 2025-12-04T09:14:59.6664128Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6667128Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:59.6667250Z current_size = base.storage().size() 2025-12-04T09:14:59.6667419Z Autotune Choices Stats: 2025-12-04T09:14:59.6668852Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.6669111Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6669370Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6669695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6670901Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6672027Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6673190Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6674308Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6675416Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6676597Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6676906Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:59.6676980Z Autotune Choices Stats: 2025-12-04T09:14:59.6678409Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.6678884Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6679219Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6679782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6680961Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6682152Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6683293Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6684485Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6685630Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6686809Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6687989Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6689137Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6690315Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6691455Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6691712Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:59.6691849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6691926Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6691989Z unimplemented [] 2025-12-04T09:14:59.6692094Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6692287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6693561Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6693665Z graph_break [] 2025-12-04T09:14:59.6693796Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6693863Z Autotune Choices Stats: 2025-12-04T09:14:59.6695253Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6695508Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6695774Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6696092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6697225Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6698334Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6699490Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6700599Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6701760Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6702906Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6703151Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:59.6703227Z Autotune Choices Stats: 2025-12-04T09:14:59.6704681Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6705123Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6705448Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6706008Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6707204Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6708360Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6709505Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6710706Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6711894Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6713079Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6714228Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6715376Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6716559Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6717702Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6717954Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:59.6718082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6718159Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6718223Z unimplemented [] 2025-12-04T09:14:59.6718330Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6718555Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6719771Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6719840Z graph_break [] 2025-12-04T09:14:59.6719967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6720035Z Autotune Choices Stats: 2025-12-04T09:14:59.6721452Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6721701Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6721925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6722242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6723372Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6724512Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6725636Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6726745Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6727889Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6729043Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6729282Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:59.6729351Z Autotune Choices Stats: 2025-12-04T09:14:59.6730821Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6731261Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6731589Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6732177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6733450Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6734614Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6735844Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6737028Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6738170Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6739349Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6740495Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6741688Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6742830Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6743973Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6744256Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:59.6744414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6744489Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6744550Z unimplemented [] 2025-12-04T09:14:59.6744655Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6744849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6746027Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6746091Z graph_break [] 2025-12-04T09:14:59.6746221Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6746287Z Autotune Choices Stats: 2025-12-04T09:14:59.6747708Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6747950Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6748173Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6748521Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6749649Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6750758Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6751877Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6753019Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6754158Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6755301Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6755546Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:59.6755621Z Autotune Choices Stats: 2025-12-04T09:14:59.6757050Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6757517Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6757848Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6758408Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6759585Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6760777Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6762163Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6763331Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6764551Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6765710Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6766912Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6768060Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6769222Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6770419Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6770720Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:59.6770850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6770924Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6770986Z unimplemented [] 2025-12-04T09:14:59.6771086Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6771272Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6772549Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6772623Z graph_break [] 2025-12-04T09:14:59.6772753Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6772818Z Autotune Choices Stats: 2025-12-04T09:14:59.6774202Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6774481Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6774704Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6775018Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6776147Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6777261Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6778416Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6779569Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6780683Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6781835Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6782075Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:59.6782139Z Autotune Choices Stats: 2025-12-04T09:14:59.6783562Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6784027Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6784371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6784929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6786093Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6787282Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6788458Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6789641Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6790781Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6791929Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6793106Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6794251Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6795436Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6796609Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6796860Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:59.6796986Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6797054Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6797138Z unimplemented [] 2025-12-04T09:14:59.6797240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6797428Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6798629Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6798697Z graph_break [] 2025-12-04T09:14:59.6798821Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6798885Z Autotune Choices Stats: 2025-12-04T09:14:59.6800284Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.6800561Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6800786Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6801106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6802256Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6803409Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6804577Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6805704Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6806844Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6807965Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6808208Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:59.6808318Z Autotune Choices Stats: 2025-12-04T09:14:59.6809749Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6810187Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6810517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6811086Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6812345Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6813558Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6814743Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6815902Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6817049Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6818233Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6819383Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6820527Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6821703Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6822878Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6823130Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:59.6823262Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6823367Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6823431Z unimplemented [] 2025-12-04T09:14:59.6823534Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6823722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6824895Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6824961Z graph_break [] 2025-12-04T09:14:59.6825123Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6825190Z Autotune Choices Stats: 2025-12-04T09:14:59.6826593Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6826839Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6827060Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6827375Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6828503Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6829664Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6830820Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6831971Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6833081Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6834198Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6834473Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:59.6834543Z Autotune Choices Stats: 2025-12-04T09:14:59.6835967Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.6836406Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6836730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6837343Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6838534Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6839694Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6840878Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6842032Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6843212Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6844361Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6845513Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6846690Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6847871Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6849043Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6849300Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:59.6849426Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6849498Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6849559Z unimplemented [] 2025-12-04T09:14:59.6849661Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6849853Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6851029Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6851130Z graph_break [] 2025-12-04T09:14:59.6851267Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6851336Z Autotune Choices Stats: 2025-12-04T09:14:59.6852788Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.6853035Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6853254Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6853567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6854730Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6855874Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6857027Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6858142Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6859254Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6860402Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6860642Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:59.6860715Z Autotune Choices Stats: 2025-12-04T09:14:59.6862311Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6862761Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6863147Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6863753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6864919Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6866125Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6867276Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6868429Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6869685Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6870837Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6872021Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6873198Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6874352Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6875531Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6875780Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.6875910Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6875984Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6876045Z unimplemented [] 2025-12-04T09:14:59.6876179Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6876371Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6877546Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6877611Z graph_break [] 2025-12-04T09:14:59.6877739Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6877807Z Autotune Choices Stats: 2025-12-04T09:14:59.6879197Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6879438Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6879701Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6880019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6881183Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6882289Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6883434Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6884550Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6885696Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6886811Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6887054Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.6887119Z Autotune Choices Stats: 2025-12-04T09:14:59.6888581Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6889052Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6889384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6889939Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6891136Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6892370Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6893533Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6894725Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6895873Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6897058Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6898242Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6899388Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6900590Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6901744Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6902028Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.6902163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6902240Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6902304Z unimplemented [] 2025-12-04T09:14:59.6902410Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6902601Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6903779Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6903847Z graph_break [] 2025-12-04T09:14:59.6903980Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6904046Z Autotune Choices Stats: 2025-12-04T09:14:59.6905465Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6905741Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6905970Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6906290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6907418Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6908562Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6909679Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6910795Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6911949Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6913069Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6913314Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.6913380Z Autotune Choices Stats: 2025-12-04T09:14:59.6914848Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6915311Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6915638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6916200Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6917403Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6918564Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6919751Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6920922Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6922073Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6923258Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6924435Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6925622Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6926779Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6927925Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6928209Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.6928338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6928418Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6928482Z unimplemented [] 2025-12-04T09:14:59.6928585Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6928777Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6929959Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6930027Z graph_break [] 2025-12-04T09:14:59.6930155Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6930223Z Autotune Choices Stats: 2025-12-04T09:14:59.6931653Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.6931943Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6932163Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6932515Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6933700Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6934819Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6935939Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6937088Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6938207Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6939334Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6939607Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.6939706Z Autotune Choices Stats: 2025-12-04T09:14:59.6941136Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6941569Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6941936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6942494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6943660Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6944848Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6946012Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6947172Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6948367Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6949555Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6950733Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6951885Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6953035Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6954210Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6954463Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.6954593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6954663Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6954741Z unimplemented [] 2025-12-04T09:14:59.6954845Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6955033Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6956202Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6956300Z graph_break [] 2025-12-04T09:14:59.6956458Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6956525Z Autotune Choices Stats: 2025-12-04T09:14:59.6957906Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.6958144Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6958365Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6958713Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6959841Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6960952Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6962277Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6963402Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6964524Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6965721Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6966017Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.6966084Z Autotune Choices Stats: 2025-12-04T09:14:59.6967526Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.6968005Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6968335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6968890Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6970072Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6971279Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6972494Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6973698Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6974877Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6976029Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6977208Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6978370Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.6979552Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6980703Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.6980954Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.6981082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.6981149Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.6981217Z unimplemented [] 2025-12-04T09:14:59.6981318Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.6981507Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.6982712Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.6982811Z graph_break [] 2025-12-04T09:14:59.6982936Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.6983005Z Autotune Choices Stats: 2025-12-04T09:14:59.6984395Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.6984675Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6984897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6985212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6986345Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6987494Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6988620Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6989752Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.6990902Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.6992048Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6992300Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.6992365Z Autotune Choices Stats: 2025-12-04T09:14:59.6993840Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.6994282Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.6994616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.6995174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.6996391Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.6997549Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.6998708Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.6999912Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7001098Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7002332Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7003487Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7004645Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7005834Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7006987Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7007234Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.7007360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7007427Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7007533Z unimplemented [] 2025-12-04T09:14:59.7007672Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7007860Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7009037Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7009103Z graph_break [] 2025-12-04T09:14:59.7009236Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7009301Z Autotune Choices Stats: 2025-12-04T09:14:59.7010720Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7010962Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7011181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7011494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7012664Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7013812Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7014932Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7016076Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7017232Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7018343Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7018593Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.7018690Z Autotune Choices Stats: 2025-12-04T09:14:59.7020135Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7020573Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7020936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7021496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7022669Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7023822Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7025013Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7026198Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7027394Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7028549Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7029705Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7030889Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7032044Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7033225Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7033504Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.7033635Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7033703Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7033772Z unimplemented [] 2025-12-04T09:14:59.7033872Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7034054Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7035237Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7035299Z graph_break [] 2025-12-04T09:14:59.7035431Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7035535Z Autotune Choices Stats: 2025-12-04T09:14:59.7036928Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7037172Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7037395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7037741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7038875Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7040002Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7041120Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7042268Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7043431Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7044572Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7044823Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.7044888Z Autotune Choices Stats: 2025-12-04T09:14:59.7046322Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.7046785Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7047111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7047670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7048847Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7050033Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7051239Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7052441Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7053640Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7054799Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7055980Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7057135Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7058291Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7059477Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7059778Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.7059907Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7059976Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7060047Z unimplemented [] 2025-12-04T09:14:59.7060147Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7060330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7061542Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7061609Z graph_break [] 2025-12-04T09:14:59.7061743Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7061966Z Autotune Choices Stats: 2025-12-04T09:14:59.7063385Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7063696Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7063918Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7064235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7065373Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7066495Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7067826Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7068983Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7070148Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7071259Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7071523Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.7071592Z Autotune Choices Stats: 2025-12-04T09:14:59.7073023Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.7073490Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7073821Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7074383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7075580Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7076775Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7077930Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7079117Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7080270Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7081457Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7082601Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7083754Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7084946Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7086126Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7086377Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.7086508Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7086578Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7086648Z unimplemented [] 2025-12-04T09:14:59.7086749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7086965Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7088142Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7088202Z graph_break [] 2025-12-04T09:14:59.7088345Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7088416Z Autotune Choices Stats: 2025-12-04T09:14:59.7089824Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7090100Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7090321Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7090638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7091755Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7092965Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7094121Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7095232Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7096381Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7097493Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7097788Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.7097856Z Autotune Choices Stats: 2025-12-04T09:14:59.7099300Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7099736Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7100065Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7100627Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7101831Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7103028Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7104213Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7105377Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7106522Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7107709Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7108857Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7110046Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7111220Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7112372Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7112656Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.7112785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7112854Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7112922Z unimplemented [] 2025-12-04T09:14:59.7113022Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7113206Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7114384Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7114498Z graph_break [] 2025-12-04T09:14:59.7114636Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7114702Z Autotune Choices Stats: 2025-12-04T09:14:59.7116100Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.7116342Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7116566Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7116882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7118043Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7119190Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7120305Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7121449Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7122562Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7123724Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7123982Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.7124051Z Autotune Choices Stats: 2025-12-04T09:14:59.7125483Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7125918Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7126278Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7126834Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7128033Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7129219Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7130382Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7131535Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7132774Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7133929Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7135087Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7136276Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7137507Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7138695Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7138946Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.7139073Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7139139Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7139205Z unimplemented [] 2025-12-04T09:14:59.7139313Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7139495Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7140710Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7140769Z graph_break [] 2025-12-04T09:14:59.7140902Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7140967Z Autotune Choices Stats: 2025-12-04T09:14:59.7142360Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7142601Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7142814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7143131Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7144289Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7145439Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7146593Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7147712Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7148828Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7149968Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7150216Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.7150282Z Autotune Choices Stats: 2025-12-04T09:14:59.7151708Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7152169Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7152532Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7153090Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7154260Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7155464Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7156622Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7157797Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7158946Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7160102Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7161274Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7162616Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7163823Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7164982Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7165234Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.7165365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7165479Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7165547Z unimplemented [] 2025-12-04T09:14:59.7165649Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7165831Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7167010Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7167085Z graph_break [] 2025-12-04T09:14:59.7167218Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7167287Z Autotune Choices Stats: 2025-12-04T09:14:59.7168680Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7168971Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7169197Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7169569Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7170693Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7171846Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7173027Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7174140Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7175292Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7176406Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7176661Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.7176727Z Autotune Choices Stats: 2025-12-04T09:14:59.7178189Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7178658Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7178994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7179557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7180760Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7181916Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7183079Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7184258Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7185413Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7186599Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7187797Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7188957Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7190143Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7191294Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7191575Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.7191712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7191780Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7191846Z unimplemented [] 2025-12-04T09:14:59.7191949Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7192133Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7193315Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7193377Z graph_break [] 2025-12-04T09:14:59.7193510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7193574Z Autotune Choices Stats: 2025-12-04T09:14:59.7194997Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7195298Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7195515Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7195837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7196965Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7198112Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7199241Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7200383Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7201502Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7202614Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7202861Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.7202927Z Autotune Choices Stats: 2025-12-04T09:14:59.7204397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.7204870Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7205201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7205803Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7206974Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7208130Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7209324Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7210485Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7211676Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7212900Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7214055Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7215244Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7216393Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7217584Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7217831Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.7217966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7218036Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7218105Z unimplemented [] 2025-12-04T09:14:59.7218209Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7218393Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7219577Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7219639Z graph_break [] 2025-12-04T09:14:59.7219768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7219833Z Autotune Choices Stats: 2025-12-04T09:14:59.7221299Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7221587Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7221805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7222127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7223286Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7224408Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7225550Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7226666Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7227782Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7228926Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7229207Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.7229274Z Autotune Choices Stats: 2025-12-04T09:14:59.7230704Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7231132Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7231495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7232052Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7233229Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7234409Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7235562Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7236704Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7237885Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7239067Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7240248Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7241403Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7242552Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7243732Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7243978Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.7244112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7244179Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7244248Z unimplemented [] 2025-12-04T09:14:59.7244350Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7244532Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7245736Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7245832Z graph_break [] 2025-12-04T09:14:59.7245962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7246029Z Autotune Choices Stats: 2025-12-04T09:14:59.7247417Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7247666Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7247912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7248248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7249377Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7250491Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7251657Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7252817Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7253968Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7255121Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7255371Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.7255436Z Autotune Choices Stats: 2025-12-04T09:14:59.7256901Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.7257340Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7257671Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7258242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7259459Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7260613Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7261768Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7263132Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7264335Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7265532Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7266682Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7267836Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7269026Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7270194Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7270442Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.7270577Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7270645Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7270707Z unimplemented [] 2025-12-04T09:14:59.7270816Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7271077Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7272287Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7272348Z graph_break [] 2025-12-04T09:14:59.7272480Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7272549Z Autotune Choices Stats: 2025-12-04T09:14:59.7273969Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7274222Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7274436Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7274757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7275880Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7277037Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7278150Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7279274Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7280420Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7281562Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7281811Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.7281880Z Autotune Choices Stats: 2025-12-04T09:14:59.7283359Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7283794Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7284124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7284721Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7285890Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7287043Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7288232Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7289410Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7290564Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7291740Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7292951Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7294148Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7295298Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7296449Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7296694Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.7296862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7296964Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7297030Z unimplemented [] 2025-12-04T09:14:59.7297140Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7297325Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7298518Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7298580Z graph_break [] 2025-12-04T09:14:59.7298717Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7298786Z Autotune Choices Stats: 2025-12-04T09:14:59.7300216Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.7300466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7300683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7301009Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7302167Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7303294Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7304417Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7305567Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7306711Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7307856Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7308107Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.7308175Z Autotune Choices Stats: 2025-12-04T09:14:59.7309603Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7310086Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7310410Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7310974Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7312136Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7313294Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7314483Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7315686Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7316872Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7318024Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7319173Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7320367Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7321522Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7322705Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7322981Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.7323115Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7323184Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7323245Z unimplemented [] 2025-12-04T09:14:59.7323354Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7323535Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7324750Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7324818Z graph_break [] 2025-12-04T09:14:59.7324943Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7325017Z Autotune Choices Stats: 2025-12-04T09:14:59.7326406Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.7326691Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7326909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7327227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7328356Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7329469Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7330607Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7331758Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7332917Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7334065Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7334313Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.7334377Z Autotune Choices Stats: 2025-12-04T09:14:59.7335807Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7336283Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7336607Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7337174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7338338Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7339539Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7340733Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7341930Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7343088Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7344233Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7345421Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7346571Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7347765Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7348951Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7349201Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.7349339Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7349417Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7349482Z unimplemented [] 2025-12-04T09:14:59.7349593Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7349779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7350996Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7351058Z graph_break [] 2025-12-04T09:14:59.7351188Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7351261Z Autotune Choices Stats: 2025-12-04T09:14:59.7352649Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7352939Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7353160Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7353484Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7354609Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7355763Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7356881Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7358035Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7359175Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7360298Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7360559Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.7360663Z Autotune Choices Stats: 2025-12-04T09:14:59.7362246Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.7362695Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7363031Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7363602Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7364842Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7366056Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7367222Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7368418Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7369585Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7370782Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7371928Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7373163Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7374344Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7375533Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7375779Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.7375920Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7376021Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7376086Z unimplemented [] 2025-12-04T09:14:59.7376196Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7376382Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7377565Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7377626Z graph_break [] 2025-12-04T09:14:59.7377753Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7377880Z Autotune Choices Stats: 2025-12-04T09:14:59.7379276Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7379527Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7379747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7380076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7381213Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7382362Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7383503Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7384649Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7385762Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7386881Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7387162Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.7387227Z Autotune Choices Stats: 2025-12-04T09:14:59.7388655Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.7389104Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7389428Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7390025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7391224Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7392378Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7393563Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7394712Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7395898Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7397039Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7398196Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7399384Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7400560Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7401743Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7401990Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.7402124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7402192Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7402253Z unimplemented [] 2025-12-04T09:14:59.7402361Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7402556Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7403743Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7403880Z graph_break [] 2025-12-04T09:14:59.7404008Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7404080Z Autotune Choices Stats: 2025-12-04T09:14:59.7405473Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7405735Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7405955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7406276Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7407441Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7408597Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7409764Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7410884Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7411998Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7413207Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7413456Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:59.7413524Z Autotune Choices Stats: 2025-12-04T09:14:59.7414948Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.7415385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7415745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7416346Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7417507Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7418701Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7419864Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7421015Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7422203Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7423347Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7424534Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7425721Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7426866Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7428047Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7428295Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:59.7428431Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7428500Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7428564Z unimplemented [] 2025-12-04T09:14:59.7428674Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7428887Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7430076Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7430135Z graph_break [] 2025-12-04T09:14:59.7430262Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7430340Z Autotune Choices Stats: 2025-12-04T09:14:59.7431730Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7431980Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7432196Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7432550Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7433710Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7434822Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7435966Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7437096Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7438236Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7439350Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7439593Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:59.7439665Z Autotune Choices Stats: 2025-12-04T09:14:59.7441125Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.7441614Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7441941Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7442505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7443701Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7444862Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7446017Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7447198Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7448351Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7449529Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7450685Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7451860Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7453128Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7454286Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7454563Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:59.7454698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7454766Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7454828Z unimplemented [] 2025-12-04T09:14:59.7454933Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7455114Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7456298Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7456362Z graph_break [] 2025-12-04T09:14:59.7456489Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7456570Z Autotune Choices Stats: 2025-12-04T09:14:59.7457989Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7458273Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7458493Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7458810Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7459940Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7461092Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7462369Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7463509Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7464686Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7465806Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7466059Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:59.7466133Z Autotune Choices Stats: 2025-12-04T09:14:59.7467608Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:59.7468091Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7468415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7468978Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7470201Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7471361Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7472541Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7473693Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7474847Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7476041Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7477230Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7478406Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7479571Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7480723Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7481000Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:59.7481134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7481202Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7481265Z unimplemented [] 2025-12-04T09:14:59.7481373Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7481557Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7482737Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7482807Z graph_break [] 2025-12-04T09:14:59.7482932Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7483011Z Autotune Choices Stats: 2025-12-04T09:14:59.7484438Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7484727Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7484947Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7485266Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7486421Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7487543Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7488651Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7489803Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7490917Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7492038Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7492375Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:59.7492486Z Autotune Choices Stats: 2025-12-04T09:14:59.7493916Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.7494355Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7494718Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7495289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7496454Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7497618Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7498814Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7499982Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7501165Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7502342Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7503526Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7504674Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7505823Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7507025Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7507269Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:59.7507400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7507476Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7507537Z unimplemented [] 2025-12-04T09:14:59.7507643Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7507822Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7508994Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7509236Z graph_break [] 2025-12-04T09:14:59.7509368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7509476Z Autotune Choices Stats: 2025-12-04T09:14:59.7510865Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7511113Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7511329Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7511652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7512814Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7513926Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7515066Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7516179Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7517289Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7518441Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7518715Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:59.7518787Z Autotune Choices Stats: 2025-12-04T09:14:59.7520220Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.7520705Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7521032Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7521604Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7522764Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7523956Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7525107Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7526297Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7527480Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7528627Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7529814Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7530959Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7532143Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7533354Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7533604Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:59.7533734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7533807Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7533881Z unimplemented [] 2025-12-04T09:14:59.7533991Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7534173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7535383Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7535483Z graph_break [] 2025-12-04T09:14:59.7535609Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7535682Z Autotune Choices Stats: 2025-12-04T09:14:59.7537068Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7537407Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7537627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7537944Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7539078Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7540234Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7541345Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7542464Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7543603Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7544758Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7545000Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:59.7545073Z Autotune Choices Stats: 2025-12-04T09:14:59.7553011Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.7553528Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7553879Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7554464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7555720Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7556897Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7558053Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7559250Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7560441Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7561610Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7562992Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7564145Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7565400Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7566558Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7566813Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:59.7566965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7567050Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7567170Z unimplemented [] 2025-12-04T09:14:59.7567280Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7567528Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7568723Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7568793Z graph_break [] 2025-12-04T09:14:59.7568928Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7569006Z Autotune Choices Stats: 2025-12-04T09:14:59.7570464Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7570723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7570944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7571263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7572483Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7573635Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7574754Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7575907Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7577054Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7578173Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7578421Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:14:59.7578526Z Autotune Choices Stats: 2025-12-04T09:14:59.7579957Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.7580401Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7580762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7581329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7582504Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7583661Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7584833Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7586009Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7587221Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7588380Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7589532Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7590707Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7591862Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7593041Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7593370Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:14:59.7593508Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7593588Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7593652Z unimplemented [] 2025-12-04T09:14:59.7593758Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7593953Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7595126Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7595193Z graph_break [] 2025-12-04T09:14:59.7595322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7595429Z Autotune Choices Stats: 2025-12-04T09:14:59.7596835Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7597085Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7597315Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7597667Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7598804Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7599916Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7601035Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7602183Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7603329Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7604488Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7604740Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:14:59.7604813Z Autotune Choices Stats: 2025-12-04T09:14:59.7606243Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:14:59.7606709Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7607035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7607600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7608768Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7609959Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7611142Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7612354Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7613536Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7614689Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7615880Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7617030Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7618185Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7619363Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7619664Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:14:59.7619796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7619874Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7619938Z unimplemented [] 2025-12-04T09:14:59.7620042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7620229Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7621453Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7621527Z graph_break [] 2025-12-04T09:14:59.7621656Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7621723Z Autotune Choices Stats: 2025-12-04T09:14:59.7623125Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7623409Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7623631Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7623947Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7625081Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7626197Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7627354Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7628502Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7629646Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7630769Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7631012Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:14:59.7631084Z Autotune Choices Stats: 2025-12-04T09:14:59.7632514Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.7632987Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7633312Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7633881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7635091Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7636248Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7637431Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7638620Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7639764Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7640949Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7642098Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7643250Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7644434Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7645609Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7645856Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:14:59.7645985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7646067Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7646135Z unimplemented [] 2025-12-04T09:14:59.7646240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7646462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7647640Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7647704Z graph_break [] 2025-12-04T09:14:59.7647831Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7647898Z Autotune Choices Stats: 2025-12-04T09:14:59.7649288Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7649564Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7649787Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7650101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7651240Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7652438Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7653634Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7654760Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7655909Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7657032Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7657312Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:14:59.7657385Z Autotune Choices Stats: 2025-12-04T09:14:59.7658817Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.7659262Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7659588Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7660152Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7661349Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7662735Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7663968Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7665125Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7666274Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7667486Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7668640Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7669843Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7671038Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7672190Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7672478Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:14:59.7672612Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7672687Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7672750Z unimplemented [] 2025-12-04T09:14:59.7672853Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7673043Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7674216Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7674318Z graph_break [] 2025-12-04T09:14:59.7674448Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7674513Z Autotune Choices Stats: 2025-12-04T09:14:59.7675905Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7676147Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7676374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7676690Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7677856Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7678997Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7680127Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7681275Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7682396Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7683575Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7683820Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:14:59.7683893Z Autotune Choices Stats: 2025-12-04T09:14:59.7685329Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.7685769Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7686095Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7686684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7687892Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7689083Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7690235Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7691386Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7692620Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7693781Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7694935Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7696127Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7697320Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7698507Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7698761Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:14:59.7698889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7698963Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7699030Z unimplemented [] 2025-12-04T09:14:59.7699131Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7699320Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7700503Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7700603Z graph_break [] 2025-12-04T09:14:59.7700730Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7700795Z Autotune Choices Stats: 2025-12-04T09:14:59.7702183Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7702425Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7702648Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7702962Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7704127Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7705274Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7706440Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7707551Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7708667Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7709816Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7710058Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:14:59.7710131Z Autotune Choices Stats: 2025-12-04T09:14:59.7711552Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.7712015Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7712371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7712930Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7714097Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7715316Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7716461Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7717641Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7718789Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7719942Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7721123Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7722302Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7723481Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7724630Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7724879Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:14:59.7725007Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7725112Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7725176Z unimplemented [] 2025-12-04T09:14:59.7725277Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7725463Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7726643Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7726708Z graph_break [] 2025-12-04T09:14:59.7726840Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7726907Z Autotune Choices Stats: 2025-12-04T09:14:59.7728310Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7728550Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7728854Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7729207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7730362Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7731511Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7732685Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7733815Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7734969Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7736095Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7736341Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:14:59.7736407Z Autotune Choices Stats: 2025-12-04T09:14:59.7737874Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:14:59.7738347Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7738675Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7739236Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7740443Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7741598Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7742748Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7743932Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7745077Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7746265Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7747475Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7748621Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7749812Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7750961Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7751245Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:14:59.7751418Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:59.7751505Z Traceback (most recent call last): 2025-12-04T09:14:59.7751809Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:59.7751888Z self.assertTrue( 2025-12-04T09:14:59.7752094Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:59.7752177Z raise self.failureException(msg) 2025-12-04T09:14:59.7752418Z AssertionError: False is not true : Log file /tmp/tmphg47bmc3/flex_attention_configs.json was not created 2025-12-04T09:14:59.7752426Z 2025-12-04T09:14:59.7752568Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:59.7752821Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:59.7752826Z 2025-12-04T09:14:59.7752999Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:59.7753132Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7753201Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7753270Z unimplemented [] 2025-12-04T09:14:59.7753376Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7754610Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:59.7754828Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7754889Z graph_break [] 2025-12-04T09:14:59.7755024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7756008Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:59.7756098Z current_size = base.storage().size() 2025-12-04T09:14:59.7756167Z Autotune Choices Stats: 2025-12-04T09:14:59.7757608Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.7757861Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7758081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7758403Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7759560Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7760673Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7761940Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7763139Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7764308Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7765416Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7765711Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:59.7765780Z Autotune Choices Stats: 2025-12-04T09:14:59.7767220Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.7767655Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7768035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7768589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7769757Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7770904Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7772098Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7773352Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7774537Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7775689Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7776833Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7778007Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7779155Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7780356Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7780641Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:59.7780773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7780844Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7780912Z unimplemented [] 2025-12-04T09:14:59.7781014Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7781198Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7782377Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7782472Z graph_break [] 2025-12-04T09:14:59.7782618Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7782688Z Autotune Choices Stats: 2025-12-04T09:14:59.7784077Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7784319Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7784576Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7784893Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7786018Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7787136Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7788280Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7789426Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7790542Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7791684Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7791932Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:59.7791999Z Autotune Choices Stats: 2025-12-04T09:14:59.7793427Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7793890Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7794223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7794783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7795960Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7797140Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7798341Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7799526Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7800679Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7801834Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7803004Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7804156Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7805332Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7806488Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7806767Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:59.7806902Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7806969Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7807038Z unimplemented [] 2025-12-04T09:14:59.7807139Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7807322Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7808539Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7808603Z graph_break [] 2025-12-04T09:14:59.7808738Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7808805Z Autotune Choices Stats: 2025-12-04T09:14:59.7810184Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7810484Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7810696Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7811014Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7812160Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7813318Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7814469Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7815616Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7816756Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7817868Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7818114Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:59.7818213Z Autotune Choices Stats: 2025-12-04T09:14:59.7819644Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7820073Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7820408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7820965Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7822158Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7823337Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7824493Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7825667Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7826826Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7828004Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7829146Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7830303Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7831483Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7832675Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7832917Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:59.7833050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7833119Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7833187Z unimplemented [] 2025-12-04T09:14:59.7833319Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7833505Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7834680Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7834741Z graph_break [] 2025-12-04T09:14:59.7834870Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7834972Z Autotune Choices Stats: 2025-12-04T09:14:59.7836356Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7836600Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7836819Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7837139Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7838264Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7839423Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7840574Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7841730Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7842843Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7843955Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7844233Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:59.7844299Z Autotune Choices Stats: 2025-12-04T09:14:59.7845722Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7846154Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7846484Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7847042Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7848259Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7849440Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7850633Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7851781Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7852986Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7854166Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7855317Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7856493Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7857668Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7858810Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7859084Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:59.7859220Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7859288Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7859357Z unimplemented [] 2025-12-04T09:14:59.7859459Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7859640Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7860830Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7860967Z graph_break [] 2025-12-04T09:14:59.7861101Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7861166Z Autotune Choices Stats: 2025-12-04T09:14:59.7862691Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7862949Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7863168Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7863486Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7864683Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7865861Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7866980Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7868146Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7869272Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7870426Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7870677Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:59.7870741Z Autotune Choices Stats: 2025-12-04T09:14:59.7872187Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7872623Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7872989Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7873595Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7874768Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7875947Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7877094Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7878244Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7879425Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7880571Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7881747Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7882935Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7884091Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7885277Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7885526Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:59.7885662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7885729Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7885791Z unimplemented [] 2025-12-04T09:14:59.7885898Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7886080Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7887289Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7887350Z graph_break [] 2025-12-04T09:14:59.7887480Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7887544Z Autotune Choices Stats: 2025-12-04T09:14:59.7888939Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.7889187Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7889409Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7889761Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7890886Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7892033Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7893241Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7894362Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7895472Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7896612Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7896862Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:59.7896930Z Autotune Choices Stats: 2025-12-04T09:14:59.7898357Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7898821Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7899182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7899752Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7900920Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7902107Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7903259Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7904438Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7905591Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7906747Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7907951Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7909131Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7910305Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7911461Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7911705Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:59.7911875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7911945Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7912008Z unimplemented [] 2025-12-04T09:14:59.7912115Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7912301Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7913476Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7913540Z graph_break [] 2025-12-04T09:14:59.7913667Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7913740Z Autotune Choices Stats: 2025-12-04T09:14:59.7915127Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7915408Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7915653Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7915975Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7917118Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7918269Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7919395Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7920518Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7921666Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7922778Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7923030Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:59.7923096Z Autotune Choices Stats: 2025-12-04T09:14:59.7924555Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.7925022Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7925349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7925921Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7927118Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7928275Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7929474Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7930624Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7931787Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7933019Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7934208Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7935389Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7936535Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7937688Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7937980Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:59.7938112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7938181Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7938241Z unimplemented [] 2025-12-04T09:14:59.7938350Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7938529Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7939715Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7939777Z graph_break [] 2025-12-04T09:14:59.7939901Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7939971Z Autotune Choices Stats: 2025-12-04T09:14:59.7941389Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.7941668Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7941882Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7942200Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7943329Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7944479Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7945603Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7946756Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7947865Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7948985Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7949229Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:59.7949335Z Autotune Choices Stats: 2025-12-04T09:14:59.7950791Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7951228Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7951553Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7952152Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7953309Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7954469Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7955659Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7956804Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7958255Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7959448Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7960600Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7961941Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7963128Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7964485Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7964732Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.7964868Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7964936Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7964999Z unimplemented [] 2025-12-04T09:14:59.7965107Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7965305Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7966485Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7966544Z graph_break [] 2025-12-04T09:14:59.7966671Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7966797Z Autotune Choices Stats: 2025-12-04T09:14:59.7968238Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7968491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7968709Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7969031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7970235Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7971356Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7972541Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7973662Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.7974781Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.7975945Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7976234Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.7976302Z Autotune Choices Stats: 2025-12-04T09:14:59.7977734Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.7978171Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7978533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7979095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7980252Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7981441Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7982597Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7983748Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7984930Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7986109Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.7987289Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.7988439Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.7989599Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7990779Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.7991028Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.7991164Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.7991235Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.7991297Z unimplemented [] 2025-12-04T09:14:59.7991405Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.7991588Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.7992797Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.7992889Z graph_break [] 2025-12-04T09:14:59.7993020Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.7993096Z Autotune Choices Stats: 2025-12-04T09:14:59.7994482Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.7994732Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.7995024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.7995346Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.7996478Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7997598Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7998739Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.7999865Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8001004Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8002168Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8002414Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.8002480Z Autotune Choices Stats: 2025-12-04T09:14:59.8003935Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8004373Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8004692Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8005263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8006461Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8007620Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8008771Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8009957Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8011138Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8012414Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8013592Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8014748Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8015936Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8017094Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8017338Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.8017471Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8017538Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8017601Z unimplemented [] 2025-12-04T09:14:59.8017710Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8017925Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8019138Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8019201Z graph_break [] 2025-12-04T09:14:59.8019338Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8019412Z Autotune Choices Stats: 2025-12-04T09:14:59.8020837Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.8021087Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8021302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8021621Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8022766Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8023917Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8025036Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8026152Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8027299Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8028457Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8028701Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.8028774Z Autotune Choices Stats: 2025-12-04T09:14:59.8030234Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8030672Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8030997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8031593Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8032758Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8033932Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8035134Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8036312Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8037470Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8038648Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8039804Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8040982Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8042138Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8043292Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8043562Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.8043725Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8043794Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8043855Z unimplemented [] 2025-12-04T09:14:59.8043964Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8044146Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8045315Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8045386Z graph_break [] 2025-12-04T09:14:59.8045521Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8045600Z Autotune Choices Stats: 2025-12-04T09:14:59.8047022Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.8047270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8047487Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8047839Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8048967Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8050078Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8051203Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8052403Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8053556Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8054710Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8054956Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.8055033Z Autotune Choices Stats: 2025-12-04T09:14:59.8056457Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8056929Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8057255Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8057818Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8058982Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8060181Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8061361Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8062689Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8063913Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8065075Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8066306Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8067459Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8068615Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8069817Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8070118Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.8070256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8070325Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8070388Z unimplemented [] 2025-12-04T09:14:59.8070494Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8070674Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8071889Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8071962Z graph_break [] 2025-12-04T09:14:59.8072089Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8072161Z Autotune Choices Stats: 2025-12-04T09:14:59.8073550Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8073834Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8074049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8074365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8075494Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8076617Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8077763Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8078914Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8080036Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8081184Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8081426Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.8081496Z Autotune Choices Stats: 2025-12-04T09:14:59.8082927Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.8083396Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8083720Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8084286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8085448Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8086654Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8087836Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8089023Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8090195Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8091343Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8092580Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8093734Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8094938Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8096140Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8096385Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.8096513Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8096590Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8096655Z unimplemented [] 2025-12-04T09:14:59.8096763Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8096949Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8098179Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8098247Z graph_break [] 2025-12-04T09:14:59.8098373Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8098446Z Autotune Choices Stats: 2025-12-04T09:14:59.8099843Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8100127Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8100346Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8100661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8101796Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8102946Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8104094Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8105209Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8106359Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8107480Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8107756Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.8107830Z Autotune Choices Stats: 2025-12-04T09:14:59.8109248Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8109685Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8110012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8110585Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8111786Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8112980Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8114157Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8115326Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8116481Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8117668Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8118823Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8119981Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8121169Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8122360Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8122609Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.8122770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8122848Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8122911Z unimplemented [] 2025-12-04T09:14:59.8123022Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8123204Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8124372Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8124440Z graph_break [] 2025-12-04T09:14:59.8124609Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8124681Z Autotune Choices Stats: 2025-12-04T09:14:59.8126068Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8126313Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8126531Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8126847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8127976Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8129132Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8130337Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8131588Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8132747Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8133873Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8134150Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.8134229Z Autotune Choices Stats: 2025-12-04T09:14:59.8135659Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.8136093Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8136414Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8137005Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8138194Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8139351Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8140529Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8141689Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8142879Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8144026Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8145175Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8146350Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8147529Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8148715Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8148958Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.8149091Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8149168Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8149230Z unimplemented [] 2025-12-04T09:14:59.8149337Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8149520Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8150696Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8150797Z graph_break [] 2025-12-04T09:14:59.8150923Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8150996Z Autotune Choices Stats: 2025-12-04T09:14:59.8152394Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8152644Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8152858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8153173Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8154335Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8155486Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8156635Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8157747Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8158857Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8160019Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8160260Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.8160336Z Autotune Choices Stats: 2025-12-04T09:14:59.8161755Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.8162359Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8162765Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8163375Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8164541Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8165738Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8166890Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8168048Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8169245Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8170397Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8171585Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8172821Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8173990Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8175187Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8175431Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.8175562Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8175640Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8175701Z unimplemented [] 2025-12-04T09:14:59.8175839Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8176031Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8177203Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8177271Z graph_break [] 2025-12-04T09:14:59.8177400Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8177473Z Autotune Choices Stats: 2025-12-04T09:14:59.8178864Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8179118Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8179376Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8179720Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8180854Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8181968Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8183122Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8184237Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8185383Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8186499Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8186746Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.8186816Z Autotune Choices Stats: 2025-12-04T09:14:59.8188287Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8188756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8189086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8189647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8190840Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8192001Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8193157Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8194361Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8195528Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8196705Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8197902Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8199047Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8200237Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8201391Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8201673Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.8201801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8201875Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8201938Z unimplemented [] 2025-12-04T09:14:59.8202037Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8202236Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8203418Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8203489Z graph_break [] 2025-12-04T09:14:59.8203616Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8203690Z Autotune Choices Stats: 2025-12-04T09:14:59.8205109Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.8205386Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8205605Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8205918Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8207050Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8208195Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8209328Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8210450Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8211608Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8212772Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8213015Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.8213087Z Autotune Choices Stats: 2025-12-04T09:14:59.8214547Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8215014Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8215339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8215903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8217105Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8218268Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8219462Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8220634Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8221785Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8222967Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8224147Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8225338Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8226498Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8227646Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8227929Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.8228058Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8228132Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8228195Z unimplemented [] 2025-12-04T09:14:59.8228295Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8228485Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8229668Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8229734Z graph_break [] 2025-12-04T09:14:59.8229859Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8229924Z Autotune Choices Stats: 2025-12-04T09:14:59.8231346Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8231622Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8231835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8232159Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8233328Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8234438Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8235553Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8236703Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8237814Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8238964Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8239233Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.8239307Z Autotune Choices Stats: 2025-12-04T09:14:59.8240752Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8241185Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8241544Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8242105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8243270Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8244462Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8245609Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8246776Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8247954Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8249153Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8250345Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8251494Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8252701Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8253882Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8254131Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.8254259Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8254334Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8254400Z unimplemented [] 2025-12-04T09:14:59.8254499Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8254686Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8255899Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8255974Z graph_break [] 2025-12-04T09:14:59.8256150Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8256218Z Autotune Choices Stats: 2025-12-04T09:14:59.8257607Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8257845Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8258072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8258421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8259554Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8260663Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8261966Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8263115Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8264246Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8265464Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8265759Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.8265832Z Autotune Choices Stats: 2025-12-04T09:14:59.8267257Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8267736Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8268061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8268624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8269803Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8271006Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8272155Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8273338Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8274512Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8275682Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8276874Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8278019Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8279204Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8280354Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8280603Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.8280737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8280810Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8280872Z unimplemented [] 2025-12-04T09:14:59.8280973Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8281170Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8282374Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8282474Z graph_break [] 2025-12-04T09:14:59.8282600Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8282667Z Autotune Choices Stats: 2025-12-04T09:14:59.8284058Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8284334Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8284555Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8284869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8286000Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8287162Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8288285Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8289422Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8290587Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8293214Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8293457Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.8293530Z Autotune Choices Stats: 2025-12-04T09:14:59.8295002Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.8295444Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8295768Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8296370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8297531Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8298693Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8299859Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8301051Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8302239Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8303431Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8304603Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8305749Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8306937Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8308089Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8308339Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.8308474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8308585Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8308652Z unimplemented [] 2025-12-04T09:14:59.8308790Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8308977Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8310150Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8310216Z graph_break [] 2025-12-04T09:14:59.8310342Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8310412Z Autotune Choices Stats: 2025-12-04T09:14:59.8311832Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8312089Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8312304Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8312623Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8313745Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8314894Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8316009Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8317165Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8318308Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8319435Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8319710Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.8319785Z Autotune Choices Stats: 2025-12-04T09:14:59.8321208Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8321642Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8322018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8322577Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8323742Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8324903Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8326085Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8327271Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8328461Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8329606Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8330769Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8331953Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8333197Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8334392Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8334668Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.8334799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8334876Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8334938Z unimplemented [] 2025-12-04T09:14:59.8335044Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8335227Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8336401Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8336470Z graph_break [] 2025-12-04T09:14:59.8336632Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8336709Z Autotune Choices Stats: 2025-12-04T09:14:59.8338101Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8338347Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8338600Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8338914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8340056Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8341174Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8342322Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8343442Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8344584Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8345736Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8345984Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.8346058Z Autotune Choices Stats: 2025-12-04T09:14:59.8347485Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.8347974Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8348297Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8348862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8350043Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8351238Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8352422Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8353577Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8354776Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8355932Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8357120Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8358268Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8359439Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8360624Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8360907Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.8361035Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8361108Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8361173Z unimplemented [] 2025-12-04T09:14:59.8361282Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8361466Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8362879Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8362957Z graph_break [] 2025-12-04T09:14:59.8363088Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8363162Z Autotune Choices Stats: 2025-12-04T09:14:59.8364556Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8364856Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8365074Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8365387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8366525Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8367644Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8368826Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8369999Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8371140Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8372265Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8372564Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.8372642Z Autotune Choices Stats: 2025-12-04T09:14:59.8374067Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8374541Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8374875Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8375448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8376645Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8377836Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8378983Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8380173Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8381329Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8382509Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8383660Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8384808Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8385991Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8387193Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8387438Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.8387568Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8387642Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8387706Z unimplemented [] 2025-12-04T09:14:59.8387811Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8388032Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8389226Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8389293Z graph_break [] 2025-12-04T09:14:59.8389419Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8389491Z Autotune Choices Stats: 2025-12-04T09:14:59.8390873Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.8391156Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8391375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8391690Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8392826Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8393969Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8395118Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8396247Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8397402Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8398521Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8398835Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.8398907Z Autotune Choices Stats: 2025-12-04T09:14:59.8400338Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8400775Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8401098Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8401662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8403040Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8404236Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8405419Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8406581Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8407743Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8408919Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8410083Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8411254Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8412501Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8413656Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8413937Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.8414065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8414141Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8414203Z unimplemented [] 2025-12-04T09:14:59.8414303Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8414491Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8415667Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8415769Z graph_break [] 2025-12-04T09:14:59.8415895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8415968Z Autotune Choices Stats: 2025-12-04T09:14:59.8417363Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.8417612Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8417828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8418138Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8419325Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8420466Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8421595Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8422744Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8423861Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8425011Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8425250Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.8425325Z Autotune Choices Stats: 2025-12-04T09:14:59.8426751Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8427189Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8427550Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8428113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8429318Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8430509Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8431660Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8432835Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8434030Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8435184Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8436378Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8437524Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8438713Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8439899Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8440152Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.8440280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8440354Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8440416Z unimplemented [] 2025-12-04T09:14:59.8440516Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8440706Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8441915Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8441981Z graph_break [] 2025-12-04T09:14:59.8442105Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8442178Z Autotune Choices Stats: 2025-12-04T09:14:59.8443569Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8443815Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8444030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8444342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8445508Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8446654Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8447797Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8448916Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8450035Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8451203Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8451446Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.8451524Z Autotune Choices Stats: 2025-12-04T09:14:59.8452987Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.8453457Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8453817Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8454379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8455541Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8456736Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8457887Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8459087Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8460249Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8461424Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8462812Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8464036Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8465248Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8466407Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8466660Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.8466839Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8466917Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8466980Z unimplemented [] 2025-12-04T09:14:59.8467082Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8467273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8468453Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8468521Z graph_break [] 2025-12-04T09:14:59.8468648Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8468715Z Autotune Choices Stats: 2025-12-04T09:14:59.8470113Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.8470396Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8470611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8470961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8472094Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8473234Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8474351Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8475474Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8476613Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8477744Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8477991Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.8478066Z Autotune Choices Stats: 2025-12-04T09:14:59.8479525Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.8479998Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8480323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8480887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8482091Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8483260Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8484414Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8485601Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8486756Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8487940Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8489134Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8490317Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8491482Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8492674Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8492958Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.8493087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8493162Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8493225Z unimplemented [] 2025-12-04T09:14:59.8493326Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8493518Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8494692Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8494762Z graph_break [] 2025-12-04T09:14:59.8494886Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8494953Z Autotune Choices Stats: 2025-12-04T09:14:59.8496370Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8496658Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8496881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8497194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8498315Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8499469Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8500591Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8501745Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8502857Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8503974Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8504211Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:14:59.8504315Z Autotune Choices Stats: 2025-12-04T09:14:59.8505735Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.8506204Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8506543Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8507135Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8508299Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8509461Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8510647Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8511807Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8512985Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8514190Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8515343Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8516530Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8517682Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8518865Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8519122Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:14:59.8519251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8519337Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8519399Z unimplemented [] 2025-12-04T09:14:59.8519499Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8519690Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8520861Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8520927Z graph_break [] 2025-12-04T09:14:59.8521054Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8521157Z Autotune Choices Stats: 2025-12-04T09:14:59.8522570Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.8522848Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8523073Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8523387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8524552Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8525662Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8526817Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8527937Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8529046Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8530202Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8530479Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:14:59.8530551Z Autotune Choices Stats: 2025-12-04T09:14:59.8531975Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.8532462Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8532874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8533436Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8534598Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8535805Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8541128Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8542374Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8543608Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8544812Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8546002Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8547144Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8548290Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8549471Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8549734Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:14:59.8549877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8549957Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8550024Z unimplemented [] 2025-12-04T09:14:59.8550132Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8550330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8551567Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8551673Z graph_break [] 2025-12-04T09:14:59.8551814Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8551884Z Autotune Choices Stats: 2025-12-04T09:14:59.8553292Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.8553541Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8553818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8554144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8555282Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8556396Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8557551Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8558662Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8559821Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8560971Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8561219Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:14:59.8561287Z Autotune Choices Stats: 2025-12-04T09:14:59.8562965Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:14:59.8563423Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8563756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8564316Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8565538Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8566696Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8567854Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8569055Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8570240Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8571424Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8572624Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8573771Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8574962Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8576110Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8576365Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:14:59.8576499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8576575Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8576638Z unimplemented [] 2025-12-04T09:14:59.8576741Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8576964Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8578172Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8578239Z graph_break [] 2025-12-04T09:14:59.8578368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8578433Z Autotune Choices Stats: 2025-12-04T09:14:59.8579863Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8580109Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8580335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8580651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8581789Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8582953Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8584076Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8585191Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8586336Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8587504Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8587754Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:14:59.8587827Z Autotune Choices Stats: 2025-12-04T09:14:59.8589304Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.8589741Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8590073Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8590662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8591828Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8592986Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8594168Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8595352Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8596506Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8597698Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8600143Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8602562Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8605159Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8607718Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8609264Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:14:59.8609749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8610069Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8610268Z unimplemented [] 2025-12-04T09:14:59.8610483Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8610862Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8612384Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8613754Z graph_break [] 2025-12-04T09:14:59.8614001Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8614285Z Autotune Choices Stats: 2025-12-04T09:14:59.8615840Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8617541Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8618077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8618695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8620283Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8622581Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8624889Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8627219Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8629573Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8631927Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8633369Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:14:59.8633755Z Autotune Choices Stats: 2025-12-04T09:14:59.8635304Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:14:59.8637257Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8638089Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8639047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8640866Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8643311Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8645683Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8648095Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8650515Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8652986Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8655426Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8657790Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8660169Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8662737Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8664271Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:14:59.8664725Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8665029Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8665226Z unimplemented [] 2025-12-04T09:14:59.8665426Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8665791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8667279Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8668593Z graph_break [] 2025-12-04T09:14:59.8668829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8669100Z Autotune Choices Stats: 2025-12-04T09:14:59.8670603Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.8672352Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8672892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8673504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8675029Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8677329Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8679673Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8682009Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8684302Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8686634Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8688068Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:14:59.8688446Z Autotune Choices Stats: 2025-12-04T09:14:59.8689998Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.8691956Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8692831Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8693791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8695591Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8698006Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8700450Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8702855Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8705227Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8707591Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8710002Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8712368Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8714759Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8717156Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8718622Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:14:59.8719075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8719352Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8719548Z unimplemented [] 2025-12-04T09:14:59.8719759Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8720132Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8721615Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8722920Z graph_break [] 2025-12-04T09:14:59.8723148Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8723421Z Autotune Choices Stats: 2025-12-04T09:14:59.8724931Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.8726666Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8727204Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8727820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8729344Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8731679Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8734072Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8736364Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8738690Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8740990Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8742417Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:14:59.8742836Z Autotune Choices Stats: 2025-12-04T09:14:59.8744372Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.8746290Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8747124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8748092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8749933Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8752386Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8754790Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8757158Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8759522Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8762102Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8764481Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8766849Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8769309Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8771741Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8773253Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:14:59.8773711Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8774046Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8774239Z unimplemented [] 2025-12-04T09:14:59.8774440Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8774798Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8776234Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8777540Z graph_break [] 2025-12-04T09:14:59.8777812Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8778091Z Autotune Choices Stats: 2025-12-04T09:14:59.8779586Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.8781264Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8781804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8782415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8783942Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8786281Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8788604Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8790941Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8793233Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8795526Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8796996Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:14:59.8797382Z Autotune Choices Stats: 2025-12-04T09:14:59.8798937Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:14:59.8800866Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8801699Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8802693Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8804515Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8806891Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8809295Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8811662Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8814119Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8816482Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8818848Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8821245Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8823643Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8826038Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8827504Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:14:59.8827948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8828225Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8828416Z unimplemented [] 2025-12-04T09:14:59.8828616Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8828969Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8830404Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8831759Z graph_break [] 2025-12-04T09:14:59.8831978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8832262Z Autotune Choices Stats: 2025-12-04T09:14:59.8833770Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.8835452Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8835983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8836590Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8838144Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8840482Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8842820Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8845118Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8847410Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8849733Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8851156Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:14:59.8851533Z Autotune Choices Stats: 2025-12-04T09:14:59.8853123Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.8855040Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8855909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8856898Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8858697Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8861106Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8863727Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8866101Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8868534Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8870928Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8873338Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8875744Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8878110Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8880517Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8881988Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:14:59.8882443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8882733Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8882928Z unimplemented [] 2025-12-04T09:14:59.8883168Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8883534Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8884966Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8886258Z graph_break [] 2025-12-04T09:14:59.8886481Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8886755Z Autotune Choices Stats: 2025-12-04T09:14:59.8888251Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.8889929Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8890457Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8891122Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8892741Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8895034Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8897381Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8899670Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8901995Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8904294Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8905724Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:14:59.8906108Z Autotune Choices Stats: 2025-12-04T09:14:59.8907672Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.8909650Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8910474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8911434Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8913271Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8915646Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8918019Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8920413Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8922771Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8925166Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8927561Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8929941Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8932381Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8934780Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.8936294Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:14:59.8936749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8937029Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8937219Z unimplemented [] 2025-12-04T09:14:59.8937418Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8937779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8939217Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8940525Z graph_break [] 2025-12-04T09:14:59.8940746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8941026Z Autotune Choices Stats: 2025-12-04T09:14:59.8942581Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.8946610Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8947158Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8947768Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8949308Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8951661Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8953968Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.8956239Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8958511Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.8960783Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8962402Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:14:59.8962796Z Autotune Choices Stats: 2025-12-04T09:14:59.8964445Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.8966550Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.8967381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.8968338Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.8970206Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8972688Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8975068Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8977438Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8979809Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8982229Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8984670Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.8987068Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.8989434Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.8991797Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.8993268Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:14:59.8993738Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.8994024Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.8994215Z unimplemented [] 2025-12-04T09:14:59.8994426Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.8994790Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.8996236Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.8997522Z graph_break [] 2025-12-04T09:14:59.8997751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.8998038Z Autotune Choices Stats: 2025-12-04T09:14:59.8999595Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9001359Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9001891Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9002510Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9004092Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9006395Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9008694Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9010988Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9013347Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9015638Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9017116Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:14:59.9017580Z Autotune Choices Stats: 2025-12-04T09:14:59.9019123Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.9021045Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9021905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9022863Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9024665Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9027043Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9029408Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9031778Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9034233Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9036681Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9039089Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9041448Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9043805Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9046169Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9047628Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:14:59.9048080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9048357Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9048551Z unimplemented [] 2025-12-04T09:14:59.9048754Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9049116Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9050570Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9051911Z graph_break [] 2025-12-04T09:14:59.9052165Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9052510Z Autotune Choices Stats: 2025-12-04T09:14:59.9054007Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.9055692Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9056231Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9056891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9058408Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9060706Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9063167Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9065461Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9067746Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9070103Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9071641Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:14:59.9072020Z Autotune Choices Stats: 2025-12-04T09:14:59.9073551Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:14:59.9075516Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9076360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9077325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9079121Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9081504Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9083871Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9086279Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9088715Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9091081Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9093552Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9095917Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9098304Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9100693Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9102160Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:14:59.9102612Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9102889Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9103073Z unimplemented [] 2025-12-04T09:14:59.9103275Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9103633Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9105110Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9106467Z graph_break [] 2025-12-04T09:14:59.9106701Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9106972Z Autotune Choices Stats: 2025-12-04T09:14:59.9108476Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9110205Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9110736Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9111342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9112865Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9115166Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9117470Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9119777Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9122374Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9124743Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9126178Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:14:59.9126563Z Autotune Choices Stats: 2025-12-04T09:14:59.9128137Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.9130057Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9130885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9131843Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9133685Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9136057Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9138417Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9140828Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9143259Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9145649Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9148014Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9150387Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9152757Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9155121Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9156592Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:14:59.9157097Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:14:59.9157434Z Traceback (most recent call last): 2025-12-04T09:14:59.9157926Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:14:59.9158446Z self.assertTrue( 2025-12-04T09:14:59.9158754Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:14:59.9159102Z raise self.failureException(msg) 2025-12-04T09:14:59.9159493Z AssertionError: False is not true : Log file /tmp/tmpdl9kigxh/flex_attention_configs.json was not created 2025-12-04T09:14:59.9159817Z 2025-12-04T09:14:59.9159951Z To execute this test, run the following from the base repo dir: 2025-12-04T09:14:59.9160411Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:14:59.9160739Z 2025-12-04T09:14:59.9160906Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:14:59.9161270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9161552Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9161746Z unimplemented [] 2025-12-04T09:14:59.9162169Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9163592Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:14:59.9165034Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9165358Z graph_break [] 2025-12-04T09:14:59.9165593Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9166778Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:14:59.9167907Z current_size = base.storage().size() 2025-12-04T09:14:59.9168123Z Autotune Choices Stats: 2025-12-04T09:14:59.9169637Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:14:59.9171322Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9171858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9172504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9174093Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9176494Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9178772Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9181093Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9183380Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9185676Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9187107Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:14:59.9187498Z Autotune Choices Stats: 2025-12-04T09:14:59.9189046Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:14:59.9190957Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9191787Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9192777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9194630Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9197031Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9199417Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9201781Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9204152Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9206509Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9208865Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9211267Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9213737Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9216125Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9217598Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:14:59.9218048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9218327Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9218519Z unimplemented [] 2025-12-04T09:14:59.9218719Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9219075Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9220528Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9221815Z graph_break [] 2025-12-04T09:14:59.9222035Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9222302Z Autotune Choices Stats: 2025-12-04T09:14:59.9223802Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9225488Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9226036Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9226642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9228209Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9230598Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9232913Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9235201Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9237489Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9239788Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9241207Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:14:59.9241586Z Autotune Choices Stats: 2025-12-04T09:14:59.9243112Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9245056Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9245916Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9246937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9248728Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9251136Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9253600Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9255953Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9258308Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9260674Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9263286Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9265767Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9268204Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9270580Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9272050Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:14:59.9272512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9272814Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9273007Z unimplemented [] 2025-12-04T09:14:59.9273216Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9273584Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9275031Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9276333Z graph_break [] 2025-12-04T09:14:59.9276558Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9276837Z Autotune Choices Stats: 2025-12-04T09:14:59.9278361Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9280065Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9280637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9281311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9282836Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9285173Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9287459Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9289773Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9292065Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9294451Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9295893Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:14:59.9296279Z Autotune Choices Stats: 2025-12-04T09:14:59.9297881Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9299890Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9300713Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9301679Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9303555Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9305933Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9308308Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9310688Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9313049Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9315448Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9317884Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9320254Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9322647Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9325005Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9326479Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:14:59.9326935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9327218Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9327411Z unimplemented [] 2025-12-04T09:14:59.9327610Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9327974Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9329413Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9330726Z graph_break [] 2025-12-04T09:14:59.9330947Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9331222Z Autotune Choices Stats: 2025-12-04T09:14:59.9332801Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9334571Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9335109Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9335715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9337234Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9339564Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9341859Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9342981Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9344084Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9345195Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9345442Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:14:59.9345507Z Autotune Choices Stats: 2025-12-04T09:14:59.9346964Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9347475Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9347800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9348415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9349574Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9350737Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9351895Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9353043Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9354192Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9355386Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9356603Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9357781Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9358929Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9360088Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9360332Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:14:59.9360465Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9360535Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9360596Z unimplemented [] 2025-12-04T09:14:59.9360704Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9360889Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9362256Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9362321Z graph_break [] 2025-12-04T09:14:59.9362452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9362524Z Autotune Choices Stats: 2025-12-04T09:14:59.9363970Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9364310Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9364533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9364855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9366031Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9367152Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9368264Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9369395Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9370501Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9371650Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9371963Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:14:59.9372039Z Autotune Choices Stats: 2025-12-04T09:14:59.9373504Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9373942Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9374305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9374871Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9376030Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9377193Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9378345Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9379488Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9380678Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9381896Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9383075Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9384224Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9385369Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9386524Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9386769Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:14:59.9386905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9386977Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9387042Z unimplemented [] 2025-12-04T09:14:59.9387149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9387332Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9388545Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9388639Z graph_break [] 2025-12-04T09:14:59.9388798Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9388871Z Autotune Choices Stats: 2025-12-04T09:14:59.9390260Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.9390512Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9390730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9391098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9392222Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9393351Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9394461Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9395575Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9396683Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9397838Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9398138Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:14:59.9398213Z Autotune Choices Stats: 2025-12-04T09:14:59.9399682Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9400120Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9400444Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9401010Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9402177Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9403339Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9404501Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9405681Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9406898Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9408057Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9409248Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9410407Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9411561Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9412763Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9413011Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:14:59.9413145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9413214Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9413275Z unimplemented [] 2025-12-04T09:14:59.9413381Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9413563Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9414774Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9414916Z graph_break [] 2025-12-04T09:14:59.9415041Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9415113Z Autotune Choices Stats: 2025-12-04T09:14:59.9416495Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9416776Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9416997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9417311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9418442Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9419574Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9420687Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9421809Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9422948Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9424124Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9424362Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:14:59.9424434Z Autotune Choices Stats: 2025-12-04T09:14:59.9425893Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.9426332Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9426657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9427225Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9428382Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9429547Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9430749Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9431934Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9433131Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9434313Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9435464Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9436616Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9437775Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9438925Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9439174Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:14:59.9439309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9439410Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9439503Z unimplemented [] 2025-12-04T09:14:59.9439646Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9439828Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9440993Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9441063Z graph_break [] 2025-12-04T09:14:59.9441189Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9441262Z Autotune Choices Stats: 2025-12-04T09:14:59.9442670Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.9442919Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9443137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9443461Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9444598Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9445711Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9446825Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9447988Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9449178Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9450310Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9450585Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:14:59.9450660Z Autotune Choices Stats: 2025-12-04T09:14:59.9452085Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9452566Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9452895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9453462Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9454625Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9455781Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9456969Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9458237Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9459428Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9460574Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9461731Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9463053Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9464215Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9465422Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9465759Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:14:59.9465896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9465976Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9466044Z unimplemented [] 2025-12-04T09:14:59.9466153Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9466337Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9467512Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9467580Z graph_break [] 2025-12-04T09:14:59.9467758Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9467831Z Autotune Choices Stats: 2025-12-04T09:14:59.9469214Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9469466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9469685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9469998Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9471122Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9472232Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9473374Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9474517Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9475665Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9476832Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9477074Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:14:59.9477144Z Autotune Choices Stats: 2025-12-04T09:14:59.9478571Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9479010Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9479341Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9479903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9481065Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9482248Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9483454Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9484644Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9485795Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9486938Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9488093Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9489246Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9490391Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9491575Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9491883Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:14:59.9492021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9492097Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9492158Z unimplemented [] 2025-12-04T09:14:59.9492263Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9492488Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9493703Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9493770Z graph_break [] 2025-12-04T09:14:59.9493895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9493966Z Autotune Choices Stats: 2025-12-04T09:14:59.9495346Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9495594Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9495809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9496117Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9497242Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9498353Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9499510Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9500695Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9501836Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9502954Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9503197Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:14:59.9503272Z Autotune Choices Stats: 2025-12-04T09:14:59.9504694Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9505139Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9505466Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9506029Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9507227Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9508460Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9509605Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9510791Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9511945Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9513090Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9514243Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9515391Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9521587Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9522888Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9523168Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:14:59.9523322Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9523408Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9523476Z unimplemented [] 2025-12-04T09:14:59.9523622Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9523824Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9525011Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9525085Z graph_break [] 2025-12-04T09:14:59.9525224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9525305Z Autotune Choices Stats: 2025-12-04T09:14:59.9526721Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.9526979Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9527203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9527526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9528683Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9529846Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9531035Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9532185Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9533407Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9534530Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9534780Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:14:59.9534855Z Autotune Choices Stats: 2025-12-04T09:14:59.9536289Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9536731Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9537056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9537620Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9538824Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9540057Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9541240Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9542394Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9543539Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9544690Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9545848Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9547027Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9548260Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9549416Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9549702Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:14:59.9549838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9549913Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9549978Z unimplemented [] 2025-12-04T09:14:59.9550083Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9550273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9551452Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9551531Z graph_break [] 2025-12-04T09:14:59.9551662Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9551731Z Autotune Choices Stats: 2025-12-04T09:14:59.9553126Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:14:59.9553373Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9553599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9553914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9555077Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9556248Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9557367Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9558516Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9559634Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9560756Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9560998Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:14:59.9561071Z Autotune Choices Stats: 2025-12-04T09:14:59.9562703Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9563162Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9563567Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9564245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9565435Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9566646Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9567813Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9568976Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9570131Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9571285Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9572519Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9573709Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9574904Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9576081Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9576334Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:14:59.9576464Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9576541Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9576604Z unimplemented [] 2025-12-04T09:14:59.9576711Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9576904Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9578085Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9578157Z graph_break [] 2025-12-04T09:14:59.9578285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9578353Z Autotune Choices Stats: 2025-12-04T09:14:59.9579767Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9580012Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9580234Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9580606Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9581773Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9582926Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9584085Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9585206Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9586330Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9587451Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9587694Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:14:59.9587770Z Autotune Choices Stats: 2025-12-04T09:14:59.9589199Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.9589674Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9590063Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9590631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9591810Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9593014Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9594170Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9595328Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9596494Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9597649Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9598836Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9600089Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9601278Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9602430Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9602682Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:14:59.9602814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9602891Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9602954Z unimplemented [] 2025-12-04T09:14:59.9603057Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9603246Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9604417Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9604485Z graph_break [] 2025-12-04T09:14:59.9604613Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9604683Z Autotune Choices Stats: 2025-12-04T09:14:59.9606072Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9606347Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9606636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9606949Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9608077Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9609243Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9610361Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9611488Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9612654Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9613777Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9614020Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:14:59.9614091Z Autotune Choices Stats: 2025-12-04T09:14:59.9615558Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9616056Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9616379Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9616939Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9618149Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9619303Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9620452Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9621607Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9622756Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9623944Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9625158Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9626340Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9627501Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9628652Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9628907Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:14:59.9629036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9629119Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9629191Z unimplemented [] 2025-12-04T09:14:59.9629299Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9629490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9630668Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9630738Z graph_break [] 2025-12-04T09:14:59.9630866Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9630933Z Autotune Choices Stats: 2025-12-04T09:14:59.9632355Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9632663Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9632890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9633206Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9634369Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9635484Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9636597Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9637712Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9638829Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9639956Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9640194Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:14:59.9640294Z Autotune Choices Stats: 2025-12-04T09:14:59.9641753Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.9642243Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9642570Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9643162Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9644337Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9645501Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9646652Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9647813Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9648994Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9650213Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9651375Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9652605Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9653761Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9654911Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9655164Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:14:59.9655295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9655376Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9655441Z unimplemented [] 2025-12-04T09:14:59.9655545Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9655739Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9656926Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9656999Z graph_break [] 2025-12-04T09:14:59.9657167Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9657238Z Autotune Choices Stats: 2025-12-04T09:14:59.9658664Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9658939Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9659171Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9659481Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9660644Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9661760Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9663121Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9664239Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9665363Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9666544Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9666879Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:14:59.9666949Z Autotune Choices Stats: 2025-12-04T09:14:59.9668400Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:14:59.9668892Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9669223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9669782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9670955Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9672113Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9673259Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9674412Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9675590Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9676826Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9678011Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9679166Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9680328Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9681477Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9681738Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:14:59.9681870Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9681944Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9682007Z unimplemented [] 2025-12-04T09:14:59.9682110Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9682301Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9683720Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9683867Z graph_break [] 2025-12-04T09:14:59.9683999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9684069Z Autotune Choices Stats: 2025-12-04T09:14:59.9685464Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9685727Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9685984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9686299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9687431Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9688539Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9689658Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9690777Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9691927Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9693149Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9693394Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:14:59.9693461Z Autotune Choices Stats: 2025-12-04T09:14:59.9694924Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9695358Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9695688Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9696246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9697417Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9698579Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9699732Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9700929Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9702141Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9703330Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9704478Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9705635Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9706795Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9707939Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9708194Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:14:59.9708323Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9708398Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9708464Z unimplemented [] 2025-12-04T09:14:59.9708567Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9708809Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9710065Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9710134Z graph_break [] 2025-12-04T09:14:59.9710263Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9710330Z Autotune Choices Stats: 2025-12-04T09:14:59.9711766Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.9712014Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9712240Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9712557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9713693Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9714814Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9715941Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9717064Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9718218Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9719413Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9719654Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:14:59.9719724Z Autotune Choices Stats: 2025-12-04T09:14:59.9721187Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9721620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9721955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9722512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9723678Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9724836Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9726021Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9727245Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9728396Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9729594Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9730744Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9731894Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9733129Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9734279Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9734611Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:14:59.9734803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9734872Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9734939Z unimplemented [] 2025-12-04T09:14:59.9735042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9735245Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9736422Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9736488Z graph_break [] 2025-12-04T09:14:59.9736615Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9736682Z Autotune Choices Stats: 2025-12-04T09:14:59.9738122Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9738364Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9738583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9738901Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9740036Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9741152Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9742269Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9743415Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9744614Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9745766Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9746011Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:14:59.9746079Z Autotune Choices Stats: 2025-12-04T09:14:59.9747506Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9747941Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9748270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9748826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9749995Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9751180Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9752363Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9753543Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9754724Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9755882Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9757046Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9758199Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9759354Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9760537Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9760845Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:14:59.9760974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9761043Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9761114Z unimplemented [] 2025-12-04T09:14:59.9761214Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9761402Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9762815Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9762892Z graph_break [] 2025-12-04T09:14:59.9763027Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9763099Z Autotune Choices Stats: 2025-12-04T09:14:59.9764499Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9764748Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9764968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9765279Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9766415Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9767531Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9768704Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9769931Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9771055Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9772207Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9772510Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:14:59.9772579Z Autotune Choices Stats: 2025-12-04T09:14:59.9774018Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9774452Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9774787Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9775350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9776515Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9777700Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9778915Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9780102Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9781247Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9782405Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9783551Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9784709Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9785893Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9787101Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9787347Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:14:59.9787485Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9787556Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9787625Z unimplemented [] 2025-12-04T09:14:59.9787730Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9787920Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9789139Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9789201Z graph_break [] 2025-12-04T09:14:59.9789332Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9789399Z Autotune Choices Stats: 2025-12-04T09:14:59.9790790Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9791033Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9791253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9791566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9792699Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9793844Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9795022Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9796151Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9797299Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9798404Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9798653Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:14:59.9798720Z Autotune Choices Stats: 2025-12-04T09:14:59.9800156Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.9800589Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9800920Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9801474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9802675Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9803905Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9805089Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9806245Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9807392Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9808550Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9809702Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9810883Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9812045Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9813322Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9813574Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:14:59.9813736Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9813807Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9813875Z unimplemented [] 2025-12-04T09:14:59.9813977Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9814159Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9815334Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9815396Z graph_break [] 2025-12-04T09:14:59.9815528Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9815594Z Autotune Choices Stats: 2025-12-04T09:14:59.9816998Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9817242Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9817468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9817782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9818914Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9820067Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9821245Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9822387Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9823504Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9824616Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9824866Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:14:59.9824932Z Autotune Choices Stats: 2025-12-04T09:14:59.9826351Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9826783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9827112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9827701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9828927Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9830079Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9831266Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9832419Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9833568Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9834721Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9835878Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9837082Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9838297Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9839481Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9839734Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:14:59.9839863Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9839932Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9840009Z unimplemented [] 2025-12-04T09:14:59.9840116Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9840304Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9841484Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9841546Z graph_break [] 2025-12-04T09:14:59.9841678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9841745Z Autotune Choices Stats: 2025-12-04T09:14:59.9843130Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9843374Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9843596Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9843908Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9845073Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9846255Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9847404Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9848517Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9849635Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9850749Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9850996Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:14:59.9851074Z Autotune Choices Stats: 2025-12-04T09:14:59.9852577Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:14:59.9853010Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9853374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9853992Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9855154Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9856335Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9857491Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9858643Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9859800Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9860952Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9862328Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9863599Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9864877Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9866039Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9866290Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:14:59.9866421Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9866493Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9866559Z unimplemented [] 2025-12-04T09:14:59.9866660Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9866861Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9868040Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9868102Z graph_break [] 2025-12-04T09:14:59.9868237Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9868306Z Autotune Choices Stats: 2025-12-04T09:14:59.9869703Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9869945Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9870203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9870549Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9871709Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9872840Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9873993Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9875105Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9876229Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9877341Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9877593Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:14:59.9877665Z Autotune Choices Stats: 2025-12-04T09:14:59.9879123Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9879623Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9879952Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9880509Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9881705Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9882862Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9884014Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9885160Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9886307Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9887491Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9888698Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9889855Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9891032Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9892189Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9892515Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:14:59.9892648Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9892719Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9892786Z unimplemented [] 2025-12-04T09:14:59.9892887Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9893072Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9894248Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9894314Z graph_break [] 2025-12-04T09:14:59.9894446Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9894514Z Autotune Choices Stats: 2025-12-04T09:14:59.9895954Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:14:59.9896275Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9896500Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9896816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9897939Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9899092Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9900213Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9901330Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9902448Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9903563Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9903826Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:14:59.9903893Z Autotune Choices Stats: 2025-12-04T09:14:59.9905348Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9905840Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9906170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9906732Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9907927Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9909080Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9910232Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9911382Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9912531Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9913710Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9914915Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9916100Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9917270Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9918427Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9918680Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:14:59.9918806Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9918873Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9918938Z unimplemented [] 2025-12-04T09:14:59.9919038Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9919230Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9920420Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9920481Z graph_break [] 2025-12-04T09:14:59.9920613Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9920678Z Autotune Choices Stats: 2025-12-04T09:14:59.9922093Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:14:59.9922394Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9922611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9922932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9924086Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9925203Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9926328Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9927444Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9928568Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9929730Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9930027Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:14:59.9930123Z Autotune Choices Stats: 2025-12-04T09:14:59.9931550Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:14:59.9931981Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9932411Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9932987Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9934163Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9935317Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9936472Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9937620Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9938802Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9940018Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9941199Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9942352Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9943496Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9944648Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9944894Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:14:59.9945033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9945103Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9945172Z unimplemented [] 2025-12-04T09:14:59.9945276Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9945460Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9946674Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9946766Z graph_break [] 2025-12-04T09:14:59.9946928Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9946995Z Autotune Choices Stats: 2025-12-04T09:14:59.9948384Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9948623Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9948839Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9949191Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9950315Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9951428Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9952544Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9953650Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9954771Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9955910Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9956227Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:14:59.9956295Z Autotune Choices Stats: 2025-12-04T09:14:59.9957721Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.9958195Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9958529Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9959088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9960257Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9961408Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9962743Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9963973Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9965246Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9966415Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9967619Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9968791Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9969957Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9971117Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9971377Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:14:59.9971522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9971594Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9971666Z unimplemented [] 2025-12-04T09:14:59.9971771Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9971953Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9973206Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9973328Z graph_break [] 2025-12-04T09:14:59.9973464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9973531Z Autotune Choices Stats: 2025-12-04T09:14:59.9974923Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:14:59.9975212Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9975433Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9975755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9976885Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9978004Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9979124Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9980244Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:14:59.9981396Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:14:59.9982576Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9982825Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:14:59.9982891Z Autotune Choices Stats: 2025-12-04T09:14:59.9984359Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:14:59.9984798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:14:59.9985140Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:14:59.9985707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:14:59.9986881Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9988043Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9989233Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9990385Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9991596Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9992802Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:14:59.9993950Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:14:59.9995101Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:14:59.9996255Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:14:59.9997410Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:14:59.9997654Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:14:59.9997787Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:14:59.9997893Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:14:59.9998001Z unimplemented [] 2025-12-04T09:14:59.9998189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:14:59.9998375Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:14:59.9999569Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:14:59.9999628Z graph_break [] 2025-12-04T09:14:59.9999766Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:14:59.9999831Z Autotune Choices Stats: 2025-12-04T09:15:00.0001257Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0001507Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0001725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0002046Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0003186Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0004310Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0005425Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0006573Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0007749Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0008857Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0009140Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:15:00.0009207Z Autotune Choices Stats: 2025-12-04T09:15:00.0010638Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.0011069Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0011401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0011958Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0013180Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0014346Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0015537Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0016743Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0017926Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0019082Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0020237Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0021404Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0022561Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0023752Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0024074Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:15:00.0024211Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0024281Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0024353Z unimplemented [] 2025-12-04T09:15:00.0024455Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0024637Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0025818Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0025881Z graph_break [] 2025-12-04T09:15:00.0026049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0026116Z Autotune Choices Stats: 2025-12-04T09:15:00.0027498Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0027746Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0027966Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0028296Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0029418Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0030532Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0031686Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0032837Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0033982Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0035127Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0035393Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:15:00.0035460Z Autotune Choices Stats: 2025-12-04T09:15:00.0036888Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0037323Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0037656Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0038217Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0039400Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0040584Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0041796Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0042966Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0044115Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0045263Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0046406Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0047553Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0048697Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0049889Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0050193Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:15:00.0050328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0050398Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0050461Z unimplemented [] 2025-12-04T09:15:00.0050583Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0050771Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0051988Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0052052Z graph_break [] 2025-12-04T09:15:00.0052186Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0052252Z Autotune Choices Stats: 2025-12-04T09:15:00.0053692Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0053941Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0054155Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0054476Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0055595Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0056715Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0057865Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0059057Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0060199Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0061307Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0061561Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:15:00.0061635Z Autotune Choices Stats: 2025-12-04T09:15:00.0063206Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:15:00.0063638Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0063968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0064529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0065771Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0067029Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0068192Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0069393Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0070542Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0071686Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0072834Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0073999Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0075185Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0076395Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0076641Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:15:00.0076778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0076851Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0076915Z unimplemented [] 2025-12-04T09:15:00.0077056Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0077244Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0078421Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0078482Z graph_break [] 2025-12-04T09:15:00.0078614Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0078686Z Autotune Choices Stats: 2025-12-04T09:15:00.0080082Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0080332Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0080552Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0080878Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0081997Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0083155Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0084331Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0085482Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0086603Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0087717Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0087968Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:15:00.0088035Z Autotune Choices Stats: 2025-12-04T09:15:00.0089472Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.0089911Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0090233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0090800Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0092018Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0093294Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0094485Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0095637Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0096798Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0097945Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0099097Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0100280Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0101491Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0102642Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0102918Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:15:00.0103062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0103133Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0103195Z unimplemented [] 2025-12-04T09:15:00.0103303Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0103484Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0104664Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0104726Z graph_break [] 2025-12-04T09:15:00.0104851Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0104925Z Autotune Choices Stats: 2025-12-04T09:15:00.0106312Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0106566Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0106781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0107098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0108258Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0109428Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0110545Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0111690Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0112806Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0113915Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0114175Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:15:00.0114243Z Autotune Choices Stats: 2025-12-04T09:15:00.0115675Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.0116110Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0116466Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0117061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0118261Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0119474Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0120629Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0121778Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0122933Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0124089Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0125416Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0126607Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0127787Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0128972Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0129223Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:15:00.0129356Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0129424Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0129489Z unimplemented [] 2025-12-04T09:15:00.0129601Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0129782Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0130979Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0131040Z graph_break [] 2025-12-04T09:15:00.0131167Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0131239Z Autotune Choices Stats: 2025-12-04T09:15:00.0132667Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0132919Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0133135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0133493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0134633Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0135861Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0137025Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0138144Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0139255Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0140382Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0140636Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:15:00.0140705Z Autotune Choices Stats: 2025-12-04T09:15:00.0142134Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0142603Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0142990Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0143554Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0144712Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0145917Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0147073Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0148229Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0149406Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0150552Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0151741Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0152966Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0154147Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0155300Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0155542Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:15:00.0155683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0155753Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0155816Z unimplemented [] 2025-12-04T09:15:00.0155922Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0156103Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0157283Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0157347Z graph_break [] 2025-12-04T09:15:00.0157474Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0157546Z Autotune Choices Stats: 2025-12-04T09:15:00.0158940Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0159223Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0159471Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0159826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0160951Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0162294Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0163440Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0164567Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0165682Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0166800Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0167047Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:15:00.0167115Z Autotune Choices Stats: 2025-12-04T09:15:00.0168602Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0169148Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0169476Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0170039Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0171253Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0172474Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0173652Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0174797Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0175956Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0177133Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0178344Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0179533Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0180683Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0181837Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0182085Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:15:00.0182222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0182290Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0182352Z unimplemented [] 2025-12-04T09:15:00.0182459Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0182641Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0183838Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0183904Z graph_break [] 2025-12-04T09:15:00.0184032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0184104Z Autotune Choices Stats: 2025-12-04T09:15:00.0185532Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0185860Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0186077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0186398Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0187525Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0188679Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0189792Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0190906Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0192019Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0193134Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0193378Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:15:00.0193485Z Autotune Choices Stats: 2025-12-04T09:15:00.0194916Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:15:00.0195408Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0195732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0196330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0197492Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0198645Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0199804Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0200953Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0202138Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0203356Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0204512Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0205697Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0206845Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0208003Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0208247Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:15:00.0208381Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0208448Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0208513Z unimplemented [] 2025-12-04T09:15:00.0208620Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0208802Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0209984Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0210045Z graph_break [] 2025-12-04T09:15:00.0210170Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0210277Z Autotune Choices Stats: 2025-12-04T09:15:00.0211698Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0211974Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0212192Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0212568Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0213732Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0214846Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0215960Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0217073Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0218182Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0219344Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0219645Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:15:00.0219716Z Autotune Choices Stats: 2025-12-04T09:15:00.0221140Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0221586Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0221950Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0222510Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0223672Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0224828Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0225978Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0227127Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0228308Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0229521Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0230710Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0231858Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0233017Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0234163Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0234408Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:15:00.0234542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0234610Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0234672Z unimplemented [] 2025-12-04T09:15:00.0234777Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0234960Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0236159Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0236285Z graph_break [] 2025-12-04T09:15:00.0236413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0236484Z Autotune Choices Stats: 2025-12-04T09:15:00.0237862Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0238120Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0238372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0238692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0239821Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0240933Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0242045Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0243163Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0244303Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0245478Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0245721Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:15:00.0245793Z Autotune Choices Stats: 2025-12-04T09:15:00.0247260Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0247696Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0248023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0248589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0249775Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0250934Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0252089Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0253332Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0254573Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0255749Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0256899Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0258044Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0259198Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0260351Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0260597Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:15:00.0260733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0260803Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0260866Z unimplemented [] 2025-12-04T09:15:00.0260975Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0261191Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0262588Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0262658Z graph_break [] 2025-12-04T09:15:00.0262793Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0262867Z Autotune Choices Stats: 2025-12-04T09:15:00.0264327Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0264587Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0264805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0265127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0266282Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0267410Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0268529Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0269647Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0270854Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0272054Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0272299Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:15:00.0272377Z Autotune Choices Stats: 2025-12-04T09:15:00.0273840Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0274288Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0274615Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0275188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0276352Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0277514Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0278698Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0279940Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0281098Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0282277Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0283430Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0284581Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0285735Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0286885Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0287164Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:15:00.0287330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0287448Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0287513Z unimplemented [] 2025-12-04T09:15:00.0287626Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0287809Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0288987Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0289053Z graph_break [] 2025-12-04T09:15:00.0289188Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0289263Z Autotune Choices Stats: 2025-12-04T09:15:00.0290681Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0290927Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0291149Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0291466Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0292639Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0293768Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0294885Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0296036Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0297211Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0298360Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0298606Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:15:00.0298678Z Autotune Choices Stats: 2025-12-04T09:15:00.0300100Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0300535Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0300859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0301428Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0302593Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0303794Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0304949Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0306161Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0307349Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0308498Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0309662Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0310809Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0311963Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0313154Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0313482Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:15:00.0313611Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0313687Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0313749Z unimplemented [] 2025-12-04T09:15:00.0313855Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0314038Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0315251Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0315321Z graph_break [] 2025-12-04T09:15:00.0315446Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0315518Z Autotune Choices Stats: 2025-12-04T09:15:00.0316896Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0317147Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0317365Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0317681Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0318819Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0319952Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0321099Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0322273Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0323384Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0324530Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0324772Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:15:00.0324849Z Autotune Choices Stats: 2025-12-04T09:15:00.0326270Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:15:00.0326716Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0327045Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0327615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0328788Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0329990Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0331192Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0332454Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0333643Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0334794Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0335948Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0337095Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0338280Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0339492Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0339740Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:15:00.0339867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0339944Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0340008Z unimplemented [] 2025-12-04T09:15:00.0340111Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0340301Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0341510Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0341579Z graph_break [] 2025-12-04T09:15:00.0341705Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0341776Z Autotune Choices Stats: 2025-12-04T09:15:00.0343166Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0343413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0343630Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0343946Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0345080Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0346251Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0347406Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0348575Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0349733Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0350852Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0351097Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:15:00.0351170Z Autotune Choices Stats: 2025-12-04T09:15:00.0352598Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0353030Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0353355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0353921Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0355124Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0356335Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0357514Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0358664Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0359821Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0360970Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0362286Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0363465Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0364676Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0365922Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0366170Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:15:00.0366302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0366420Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0366484Z unimplemented [] 2025-12-04T09:15:00.0366586Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0366773Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0367943Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0368011Z graph_break [] 2025-12-04T09:15:00.0368138Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0368210Z Autotune Choices Stats: 2025-12-04T09:15:00.0369588Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0369835Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0370053Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0370369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0371498Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0372706Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0373882Z triton_flex_attention_801 0.0143 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0375048Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0376159Z triton_flex_attention_800 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0377273Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0377520Z SingleProcess AUTOTUNE benchmarking takes 0.2967 seconds and 1.3426 seconds precompiling for 6 choices 2025-12-04T09:15:00.0377593Z Autotune Choices Stats: 2025-12-04T09:15:00.0379016Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0379461Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0379781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0380382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0381607Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0382760Z triton_flex_attention_backward_807 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0383936Z triton_flex_attention_backward_804 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0385102Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0386252Z triton_flex_attention_backward_809 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0387400Z triton_flex_attention_backward_810 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0388549Z triton_flex_attention_backward_813 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0389726Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0390940Z triton_flex_attention_backward_811 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0392124Z triton_flex_attention_backward_816 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0392381Z SingleProcess AUTOTUNE benchmarking takes 0.6691 seconds and 2.4016 seconds precompiling for 13 choices 2025-12-04T09:15:00.0392552Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:15:00.0392639Z Traceback (most recent call last): 2025-12-04T09:15:00.0392940Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:15:00.0393007Z self.assertTrue( 2025-12-04T09:15:00.0393215Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:15:00.0393299Z raise self.failureException(msg) 2025-12-04T09:15:00.0393552Z AssertionError: False is not true : Log file /tmp/tmpjn0myk4t/flex_attention_configs.json was not created 2025-12-04T09:15:00.0393557Z 2025-12-04T09:15:00.0393692Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.0393946Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.0393951Z 2025-12-04T09:15:00.0394121Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.0394257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0394334Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0394401Z unimplemented [] 2025-12-04T09:15:00.0394508Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0395712Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:15:00.0395903Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0395973Z graph_break [] 2025-12-04T09:15:00.0396100Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0397127Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:15:00.0397275Z current_size = base.storage().size() 2025-12-04T09:15:00.0397342Z Autotune Choices Stats: 2025-12-04T09:15:00.0398742Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.0398986Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0399215Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0399608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0400740Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0401847Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0402959Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0404070Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0405197Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0406335Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0406679Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:15:00.0406747Z Autotune Choices Stats: 2025-12-04T09:15:00.0408177Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.0408642Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0408975Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0409546Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0410720Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0411870Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0413073Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0414257Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0415479Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0416631Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0417805Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0418955Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0420107Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0421252Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0421503Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:15:00.0421632Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0421701Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0421770Z unimplemented [] 2025-12-04T09:15:00.0421873Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0422063Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0423268Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0423387Z graph_break [] 2025-12-04T09:15:00.0423523Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0423590Z Autotune Choices Stats: 2025-12-04T09:15:00.0424983Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0425258Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0425480Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0425798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0426936Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0428039Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0429146Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0430255Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0431396Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0432564Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0432813Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:15:00.0432882Z Autotune Choices Stats: 2025-12-04T09:15:00.0434341Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0434777Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0435111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0435676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0436846Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0437999Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0439150Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0440344Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0441566Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0442742Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0443893Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0445047Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0446194Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0447337Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0447584Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:15:00.0447712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0447781Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0447861Z unimplemented [] 2025-12-04T09:15:00.0447999Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0448242Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0449421Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0449484Z graph_break [] 2025-12-04T09:15:00.0449618Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0449684Z Autotune Choices Stats: 2025-12-04T09:15:00.0451103Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0451352Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0451574Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0451900Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0453079Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0454202Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0455321Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0456476Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0457647Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0458757Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0459005Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:15:00.0459071Z Autotune Choices Stats: 2025-12-04T09:15:00.0460528Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0460975Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0461309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0462022Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0463212Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0464365Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0465589Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0466832Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0468032Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0469185Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0470326Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0471480Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0472626Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0473819Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0474068Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:15:00.0474258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0474329Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0474398Z unimplemented [] 2025-12-04T09:15:00.0474503Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0474685Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0475882Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0475945Z graph_break [] 2025-12-04T09:15:00.0476077Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0476143Z Autotune Choices Stats: 2025-12-04T09:15:00.0477560Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0477806Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0478028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0478348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0479480Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0480594Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0481714Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0482863Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0484036Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0485179Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0485430Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:15:00.0485499Z Autotune Choices Stats: 2025-12-04T09:15:00.0486929Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0487360Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0487689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0488254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0489422Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0490614Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0491829Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0493076Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0494269Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0495426Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0496574Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0497724Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0498871Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0500064Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0500376Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:15:00.0500505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0500574Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0500642Z unimplemented [] 2025-12-04T09:15:00.0500745Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0500927Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0502152Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0502217Z graph_break [] 2025-12-04T09:15:00.0502353Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0502419Z Autotune Choices Stats: 2025-12-04T09:15:00.0503810Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0504054Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0504276Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0504593Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0505717Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0506830Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0507977Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0512769Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0513949Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0515071Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0515329Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:15:00.0515403Z Autotune Choices Stats: 2025-12-04T09:15:00.0516869Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0517317Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0517657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0518225Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0519414Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0520601Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0521821Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0523007Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0524160Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0525310Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0526447Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0527600Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0528776Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0529985Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0530235Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:15:00.0530390Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0530464Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0530534Z unimplemented [] 2025-12-04T09:15:00.0530646Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0530869Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0532058Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0532117Z graph_break [] 2025-12-04T09:15:00.0532251Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0532385Z Autotune Choices Stats: 2025-12-04T09:15:00.0533784Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.0534036Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0534253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0534579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0535706Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0536863Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0538092Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0539204Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0540348Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0541451Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0541706Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:15:00.0541771Z Autotune Choices Stats: 2025-12-04T09:15:00.0543207Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0543639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0543974Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0544536Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0545734Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0546947Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0548147Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0549300Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0550454Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0551606Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0552761Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0553951Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0555126Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0556313Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0556561Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:15:00.0556726Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0556799Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0556866Z unimplemented [] 2025-12-04T09:15:00.0556967Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0557152Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0558332Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0558394Z graph_break [] 2025-12-04T09:15:00.0558527Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0558592Z Autotune Choices Stats: 2025-12-04T09:15:00.0559982Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0560233Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0560452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0560777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0562521Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0563694Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0564926Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0566106Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0567240Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0568356Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0568610Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:15:00.0568675Z Autotune Choices Stats: 2025-12-04T09:15:00.0570104Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.0570538Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0570868Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0571607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0572909Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0574067Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0575262Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0576417Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0577573Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0578726Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0579868Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0581054Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0582257Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0583437Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0583691Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:15:00.0583829Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0583898Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0583961Z unimplemented [] 2025-12-04T09:15:00.0584071Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0584257Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0585443Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0585504Z graph_break [] 2025-12-04T09:15:00.0585638Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0585707Z Autotune Choices Stats: 2025-12-04T09:15:00.0587088Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.0587340Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0587556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0587880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0589047Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0590230Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0591377Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0592498Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0593618Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0594731Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0594979Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:15:00.0595045Z Autotune Choices Stats: 2025-12-04T09:15:00.0596479Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0596910Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0597272Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0597888Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0599048Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0600238Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0601389Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0602540Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0603691Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0604838Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0606036Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0607243Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0608415Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0609563Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0609805Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:15:00.0609936Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0610005Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0610076Z unimplemented [] 2025-12-04T09:15:00.0610191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0610374Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0611549Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0611607Z graph_break [] 2025-12-04T09:15:00.0611737Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0611805Z Autotune Choices Stats: 2025-12-04T09:15:00.0613266Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0613518Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0613771Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0614153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0615286Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0616399Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0617538Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0618654Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0619775Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0620883Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0621131Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:15:00.0621196Z Autotune Choices Stats: 2025-12-04T09:15:00.0622663Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0623171Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0623495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0624063Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0625256Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0626413Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0627564Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0628711Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0629870Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0631044Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0632252Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0633398Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0634590Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0635741Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0635987Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:15:00.0636124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0636191Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0636255Z unimplemented [] 2025-12-04T09:15:00.0636364Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0636547Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0637744Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0637807Z graph_break [] 2025-12-04T09:15:00.0637933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0638006Z Autotune Choices Stats: 2025-12-04T09:15:00.0639428Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0639736Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0639951Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0640269Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0641405Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0642551Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0643657Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0644777Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0645896Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0647013Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0647261Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:15:00.0647325Z Autotune Choices Stats: 2025-12-04T09:15:00.0648783Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0649293Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0649621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0650222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0651389Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0652591Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0653754Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0654902Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0656056Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0657241Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0658452Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0659642Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0660794Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0662110Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0662364Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:15:00.0662501Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0662568Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0662629Z unimplemented [] 2025-12-04T09:15:00.0662748Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0662936Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0664118Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0664177Z graph_break [] 2025-12-04T09:15:00.0664303Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0664374Z Autotune Choices Stats: 2025-12-04T09:15:00.0665821Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.0666175Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0666391Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0666714Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0667887Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0669009Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0670129Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0671248Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0672353Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0673498Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0673847Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:15:00.0673911Z Autotune Choices Stats: 2025-12-04T09:15:00.0675333Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0675782Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0676143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0676706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0677862Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0679023Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0680178Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0681329Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0682512Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0683724Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0684912Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0686065Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0687209Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0688366Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0688609Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:15:00.0688745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0688815Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0688877Z unimplemented [] 2025-12-04T09:15:00.0688982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0689162Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0690375Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0690493Z graph_break [] 2025-12-04T09:15:00.0690619Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0690693Z Autotune Choices Stats: 2025-12-04T09:15:00.0692074Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:15:00.0692368Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0692587Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0692943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0694069Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0695178Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0696298Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0697419Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0698532Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0699692Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0700014Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:15:00.0700078Z Autotune Choices Stats: 2025-12-04T09:15:00.0701524Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0701966Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0702289Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0702854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0704016Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0705169Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0706328Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0707516Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0708729Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0709875Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0711060Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0712222Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0713373Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0714523Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0714768Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:15:00.0714901Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0714969Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0715029Z unimplemented [] 2025-12-04T09:15:00.0715136Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0715316Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0716527Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0716645Z graph_break [] 2025-12-04T09:15:00.0716768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0716851Z Autotune Choices Stats: 2025-12-04T09:15:00.0718265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0718516Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0718728Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0719043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0720164Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0721279Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0722385Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0723498Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0724638Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0725810Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0726048Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:15:00.0726118Z Autotune Choices Stats: 2025-12-04T09:15:00.0727562Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.0728001Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0728322Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0728887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0730054Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0731210Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0732463Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0733677Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0734871Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0736045Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0737197Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0738349Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0739503Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0740653Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0740898Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:15:00.0741065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0741137Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0741260Z unimplemented [] 2025-12-04T09:15:00.0741365Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0741548Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0742732Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0742793Z graph_break [] 2025-12-04T09:15:00.0742919Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0742997Z Autotune Choices Stats: 2025-12-04T09:15:00.0744432Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0744684Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0744897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0745219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0746345Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0747459Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0748572Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0749726Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0750908Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0752024Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0752299Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:15:00.0752372Z Autotune Choices Stats: 2025-12-04T09:15:00.0753784Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0754224Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0754551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0755114Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0756274Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0757428Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0758612Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0759826Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0761025Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0762351Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0763533Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0764695Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0765852Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0767087Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0767430Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:15:00.0767568Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0767637Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0767700Z unimplemented [] 2025-12-04T09:15:00.0767809Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0767993Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0769163Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0769232Z graph_break [] 2025-12-04T09:15:00.0769408Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0769481Z Autotune Choices Stats: 2025-12-04T09:15:00.0770870Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0771119Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0771338Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0771661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0772826Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0773945Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0775091Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0776272Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0777383Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0778530Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0778772Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:15:00.0778859Z Autotune Choices Stats: 2025-12-04T09:15:00.0780287Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.0780727Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0781047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0781612Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0782774Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0783968Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0785178Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0786364Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0787529Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0788674Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0789836Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0790992Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0792178Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0793323Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0793643Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:15:00.0793773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0793838Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0793898Z unimplemented [] 2025-12-04T09:15:00.0794003Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0794188Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0795404Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0795469Z graph_break [] 2025-12-04T09:15:00.0795594Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0795662Z Autotune Choices Stats: 2025-12-04T09:15:00.0797042Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0797284Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0797499Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0797820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0798940Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0800061Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0801204Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0802378Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0803516Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0804631Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0804874Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:15:00.0804956Z Autotune Choices Stats: 2025-12-04T09:15:00.0806383Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:15:00.0806817Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0807141Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0807707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0808949Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0810168Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0811316Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0812549Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0813721Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0814866Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0816022Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0817170Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0818357Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0819579Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0819827Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:15:00.0819957Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0820046Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0820108Z unimplemented [] 2025-12-04T09:15:00.0820246Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0820430Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0821603Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0821669Z graph_break [] 2025-12-04T09:15:00.0821794Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0821869Z Autotune Choices Stats: 2025-12-04T09:15:00.0823265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0823512Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0823734Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0824053Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0825184Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0826329Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0827520Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0828669Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0829791Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0830908Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0831153Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:15:00.0831226Z Autotune Choices Stats: 2025-12-04T09:15:00.0832643Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0833083Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0833403Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0833969Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0835163Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0836381Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0837569Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0838721Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0839876Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0841020Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0842171Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0843345Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0844553Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0845737Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0845984Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:15:00.0846109Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0846181Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0846241Z unimplemented [] 2025-12-04T09:15:00.0846347Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0846528Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0847713Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0847779Z graph_break [] 2025-12-04T09:15:00.0847914Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0847985Z Autotune Choices Stats: 2025-12-04T09:15:00.0849374Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.0849621Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0849834Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0850147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0851310Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0852524Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0853643Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0854792Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0855914Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0857032Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0857271Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:15:00.0857341Z Autotune Choices Stats: 2025-12-04T09:15:00.0858758Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0859194Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0859573Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0860199Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0861355Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0862741Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0863912Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0865076Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0866235Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0867386Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0868583Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0869815Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0870965Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0872150Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0872391Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:15:00.0872516Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0872589Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0872651Z unimplemented [] 2025-12-04T09:15:00.0872754Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0872937Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0874112Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0874176Z graph_break [] 2025-12-04T09:15:00.0874301Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0874371Z Autotune Choices Stats: 2025-12-04T09:15:00.0875755Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0876003Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0876220Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0876568Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0877730Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0878870Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0880023Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0881141Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0882255Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0883374Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0883613Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:15:00.0883684Z Autotune Choices Stats: 2025-12-04T09:15:00.0885107Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0885575Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0885960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0886526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0887692Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0888900Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0890049Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0891198Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0892402Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0893568Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0894759Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0895964Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0897144Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0898303Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0898553Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:15:00.0898681Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0898755Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0898817Z unimplemented [] 2025-12-04T09:15:00.0898918Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0899117Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0900291Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0900361Z graph_break [] 2025-12-04T09:15:00.0900487Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0900556Z Autotune Choices Stats: 2025-12-04T09:15:00.0901936Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0902215Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0902492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0902807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0903938Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0905083Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0906199Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0907310Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0908422Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0909543Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0909797Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:15:00.0909870Z Autotune Choices Stats: 2025-12-04T09:15:00.0911323Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0911819Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0912141Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0912707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0913901Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0915053Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0916200Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0917347Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0918500Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0919692Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0920919Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0922096Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0923247Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0924392Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.0924640Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:15:00.0924766Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0924838Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0924900Z unimplemented [] 2025-12-04T09:15:00.0924998Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0925186Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0926357Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0926423Z graph_break [] 2025-12-04T09:15:00.0926549Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0926623Z Autotune Choices Stats: 2025-12-04T09:15:00.0928052Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0928364Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0928577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0928894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0930058Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0931167Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0932281Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0933445Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0934554Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0935666Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0935939Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:15:00.0936010Z Autotune Choices Stats: 2025-12-04T09:15:00.0937487Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.0937924Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0938248Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0938893Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0940057Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0941210Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0942361Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0943514Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0944690Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0945902Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0947056Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0948243Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0949418Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0950570Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0950817Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:15:00.0950950Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0951026Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0951089Z unimplemented [] 2025-12-04T09:15:00.0951189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0951378Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0952548Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0952614Z graph_break [] 2025-12-04T09:15:00.0952770Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0952889Z Autotune Choices Stats: 2025-12-04T09:15:00.0954320Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0954565Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0954781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0955097Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0956279Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0957390Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0958508Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0959632Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0960741Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0962049Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0962396Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:15:00.0962480Z Autotune Choices Stats: 2025-12-04T09:15:00.0963915Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.0964398Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0964723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0965288Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0966458Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0967635Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0968782Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0969938Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0971113Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0972365Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0973563Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0974714Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0975863Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0977005Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0977257Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:15:00.0977386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.0977458Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.0977520Z unimplemented [] 2025-12-04T09:15:00.0977619Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.0977803Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.0979004Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.0979130Z graph_break [] 2025-12-04T09:15:00.0979254Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.0979331Z Autotune Choices Stats: 2025-12-04T09:15:00.0980730Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.0980971Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0981222Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0981537Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0982660Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0983769Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0984887Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.0986004Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.0987159Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0988333Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0988573Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:15:00.0988641Z Autotune Choices Stats: 2025-12-04T09:15:00.0990092Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:15:00.0990532Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.0990852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.0991421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.0992579Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.0993736Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.0994889Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0996072Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0997271Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.0998457Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.0999608Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1000754Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1001913Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1003065Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1003316Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:15:00.1003443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1003518Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1003583Z unimplemented [] 2025-12-04T09:15:00.1003685Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1003909Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1005147Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1005212Z graph_break [] 2025-12-04T09:15:00.1005338Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1005403Z Autotune Choices Stats: 2025-12-04T09:15:00.1006828Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1007087Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1007313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1007631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1008764Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1009884Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1011006Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1012122Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1013499Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1014688Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1014932Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:15:00.1015007Z Autotune Choices Stats: 2025-12-04T09:15:00.1016472Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1016910Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1017248Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1017823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1018988Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1020159Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1021339Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1022547Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1023724Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1024879Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1026035Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1027183Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1028333Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1029480Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1029758Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:15:00.1029945Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1030026Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1030090Z unimplemented [] 2025-12-04T09:15:00.1030195Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1030384Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1031559Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1031622Z graph_break [] 2025-12-04T09:15:00.1031750Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1031815Z Autotune Choices Stats: 2025-12-04T09:15:00.1033243Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.1033484Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1033712Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1034031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1035161Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1036268Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1037385Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1038526Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1039711Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1040863Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1041109Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:15:00.1041179Z Autotune Choices Stats: 2025-12-04T09:15:00.1042607Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1043046Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1043365Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1043925Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1045090Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1046282Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1047501Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1048646Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1049823Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1050988Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1052146Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1053340Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1054500Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1055681Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1055991Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:15:00.1056117Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1056188Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1056248Z unimplemented [] 2025-12-04T09:15:00.1056348Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1056533Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1057748Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1057823Z graph_break [] 2025-12-04T09:15:00.1057952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1058018Z Autotune Choices Stats: 2025-12-04T09:15:00.1059418Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.1059668Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1059889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1060205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1061334Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1062604Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1063804Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1065018Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1066134Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1067301Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1067545Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:15:00.1067611Z Autotune Choices Stats: 2025-12-04T09:15:00.1069041Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1069489Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1069811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1070377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1071540Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1072732Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1073988Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1075175Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1076327Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1077480Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1078645Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1079795Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1080976Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1082200Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1082460Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:15:00.1082587Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1082661Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1082722Z unimplemented [] 2025-12-04T09:15:00.1082822Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1083010Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1084218Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1084284Z graph_break [] 2025-12-04T09:15:00.1084406Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1084471Z Autotune Choices Stats: 2025-12-04T09:15:00.1085870Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1086116Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1086342Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1086656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1087789Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1088936Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1090123Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1091235Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1092458Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1093587Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1093832Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:15:00.1093897Z Autotune Choices Stats: 2025-12-04T09:15:00.1095324Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.1095752Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1096082Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1096639Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1097847Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1099058Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1100233Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1101384Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1102528Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1103685Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1104833Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1106017Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1107200Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1108374Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1108623Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:15:00.1108783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1108856Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1108916Z unimplemented [] 2025-12-04T09:15:00.1109017Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1109202Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1110369Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1110434Z graph_break [] 2025-12-04T09:15:00.1110559Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1110624Z Autotune Choices Stats: 2025-12-04T09:15:00.1112016Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1112255Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1112479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1112793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1113922Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1115077Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1116247Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1117392Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1118514Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1119630Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1119869Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:15:00.1119936Z Autotune Choices Stats: 2025-12-04T09:15:00.1121363Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.1121792Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1122119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1122712Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1123931Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1125088Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1126267Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1127421Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1128566Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1129718Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1130862Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1132044Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1133314Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1134500Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1134751Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:15:00.1134878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1134955Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1135014Z unimplemented [] 2025-12-04T09:15:00.1135114Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1135305Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1136472Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1136543Z graph_break [] 2025-12-04T09:15:00.1136669Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1136734Z Autotune Choices Stats: 2025-12-04T09:15:00.1138125Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1138366Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1138586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1138902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1140065Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1141237Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1142404Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1143517Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1144632Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1145748Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1145988Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:15:00.1146057Z Autotune Choices Stats: 2025-12-04T09:15:00.1147484Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.1147913Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1148279Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1148902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1150066Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1151248Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1152407Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1153559Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1154702Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1155851Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1157029Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1158254Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1159435Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1160579Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1160824Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:15:00.1160949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1161018Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1161084Z unimplemented [] 2025-12-04T09:15:00.1161181Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1161373Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1162716Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1162783Z graph_break [] 2025-12-04T09:15:00.1162925Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1162996Z Autotune Choices Stats: 2025-12-04T09:15:00.1164392Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1164636Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1164924Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1165335Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1166476Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1167587Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1168751Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1169871Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1170987Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1172106Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1172418Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:15:00.1172495Z Autotune Choices Stats: 2025-12-04T09:15:00.1173955Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1174467Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1174798Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1175361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1176557Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1177706Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1178862Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1180018Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1181164Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1182338Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1183538Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1184692Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1185874Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1187016Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1187271Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:15:00.1187404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1187474Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1187542Z unimplemented [] 2025-12-04T09:15:00.1187644Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1187833Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1189004Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1189065Z graph_break [] 2025-12-04T09:15:00.1189198Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1189264Z Autotune Choices Stats: 2025-12-04T09:15:00.1190677Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1190983Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1191204Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1191531Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1192666Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1193807Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1194922Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1196032Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1197141Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1198251Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1198500Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:15:00.1198565Z Autotune Choices Stats: 2025-12-04T09:15:00.1200020Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:15:00.1200508Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1200839Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1201483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1202686Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1203832Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1204988Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1206138Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1207277Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1208453Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1209716Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1210897Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1212051Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1213269Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1213518Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:15:00.1213644Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1213712Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1213779Z unimplemented [] 2025-12-04T09:15:00.1213882Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1214066Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1215238Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1215297Z graph_break [] 2025-12-04T09:15:00.1215427Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1215503Z Autotune Choices Stats: 2025-12-04T09:15:00.1216925Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1217227Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1217444Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1217757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1218928Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1220055Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1221169Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1222280Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1223393Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1224534Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1224809Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:15:00.1224909Z Autotune Choices Stats: 2025-12-04T09:15:00.1226333Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.1226761Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1227126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1227686Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1228865Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1230029Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1231180Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1232337Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1233510Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1234713Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1235887Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1237036Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1238202Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1239354Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1239601Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:15:00.1239729Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1239799Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1239865Z unimplemented [] 2025-12-04T09:15:00.1239964Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1240143Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1241363Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1241452Z graph_break [] 2025-12-04T09:15:00.1241621Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1241687Z Autotune Choices Stats: 2025-12-04T09:15:00.1243074Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1243315Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1243535Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1243881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1245004Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1246118Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1247236Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1248345Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1249471Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1250612Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1250931Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:15:00.1250997Z Autotune Choices Stats: 2025-12-04T09:15:00.1252474Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.1252938Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1253265Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1253824Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1254988Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1256138Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1257299Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1258487Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1259696Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1260849Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1262206Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1263375Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1264533Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1265678Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1265928Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:15:00.1266055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1266132Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1266203Z unimplemented [] 2025-12-04T09:15:00.1266304Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1266487Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1267712Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1267856Z graph_break [] 2025-12-04T09:15:00.1267986Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1268049Z Autotune Choices Stats: 2025-12-04T09:15:00.1269433Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1269726Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1269949Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1270262Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1271383Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1272497Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1273610Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1274720Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1275872Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1277046Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1277297Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:15:00.1277364Z Autotune Choices Stats: 2025-12-04T09:15:00.1278823Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1279264Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1279591Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1280154Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1281318Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1282477Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1283665Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1284810Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1286015Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1287197Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1288344Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1289501Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1290644Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1291808Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1292053Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:15:00.1292182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1292283Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1292426Z unimplemented [] 2025-12-04T09:15:00.1292557Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1292755Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1293931Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1293989Z graph_break [] 2025-12-04T09:15:00.1294119Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1294184Z Autotune Choices Stats: 2025-12-04T09:15:00.1295601Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1295845Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1296065Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1296379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1297505Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1298617Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1299737Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1300874Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1302064Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1303171Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1303451Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:15:00.1303518Z Autotune Choices Stats: 2025-12-04T09:15:00.1304941Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1305369Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1305710Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1306273Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1307439Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1308595Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1309775Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1310981Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1312159Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1313310Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1314458Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1315610Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1316758Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1317954Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1318271Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:15:00.1318401Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1318468Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1318536Z unimplemented [] 2025-12-04T09:15:00.1318635Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1318817Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1319993Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1320064Z graph_break [] 2025-12-04T09:15:00.1320234Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1320303Z Autotune Choices Stats: 2025-12-04T09:15:00.1321694Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1321935Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1322151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1322468Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1323594Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1324712Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1325868Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1327003Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1328163Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1329320Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1329567Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:15:00.1329631Z Autotune Choices Stats: 2025-12-04T09:15:00.1331076Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:15:00.1331506Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1331830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1332438Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1333605Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1334788Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1336025Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1337202Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1338351Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1339513Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1340662Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1341809Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1342956Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1344299Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1344623Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:15:00.1344751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1344819Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1344882Z unimplemented [] 2025-12-04T09:15:00.1344981Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1345165Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1346370Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1346432Z graph_break [] 2025-12-04T09:15:00.1346565Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1346629Z Autotune Choices Stats: 2025-12-04T09:15:00.1348017Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1348262Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1348478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1348811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1349945Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1351065Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1352213Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1353389Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1354536Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1355643Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1355888Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:15:00.1355955Z Autotune Choices Stats: 2025-12-04T09:15:00.1357375Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1357806Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1358135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1358697Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1359902Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1361117Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1362428Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1363641Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1364808Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1365968Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1367124Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1368275Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1369481Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1370718Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1370965Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:15:00.1371097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1371169Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1371236Z unimplemented [] 2025-12-04T09:15:00.1371371Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1371554Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1372779Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1372839Z graph_break [] 2025-12-04T09:15:00.1372969Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1373037Z Autotune Choices Stats: 2025-12-04T09:15:00.1374425Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1374670Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1374886Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1375209Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1376334Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1377482Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1378655Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1379797Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1380920Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1382046Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1382300Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:15:00.1382366Z Autotune Choices Stats: 2025-12-04T09:15:00.1383789Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1384223Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1384551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1385114Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1386320Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1387525Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1388717Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1389877Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1391033Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1392188Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1393337Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1394518Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1395728Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1396877Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1397169Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:15:00.1397306Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1397375Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1397440Z unimplemented [] 2025-12-04T09:15:00.1397539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1397720Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1398903Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1398965Z graph_break [] 2025-12-04T09:15:00.1399094Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1399159Z Autotune Choices Stats: 2025-12-04T09:15:00.1400549Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1400800Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1401016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1401331Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1402488Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1403666Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1404783Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1405926Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1407052Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1408178Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1408424Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:15:00.1408487Z Autotune Choices Stats: 2025-12-04T09:15:00.1409919Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1410352Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1410711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1411303Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1412538Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1413719Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1414872Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1416015Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1417175Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1418321Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1419512Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1420692Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1421879Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1423065Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1423306Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:15:00.1423439Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1423507Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1423571Z unimplemented [] 2025-12-04T09:15:00.1423672Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1423858Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1425031Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1425090Z graph_break [] 2025-12-04T09:15:00.1425222Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1425289Z Autotune Choices Stats: 2025-12-04T09:15:00.1426675Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1426926Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1427142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1427494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1428626Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1429824Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1430971Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1432084Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1433206Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1434315Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1434561Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:15:00.1434626Z Autotune Choices Stats: 2025-12-04T09:15:00.1436047Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1436515Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1436912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1437471Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1438646Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1439831Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1443131Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1444366Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1445555Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1446710Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1447869Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1449111Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1450299Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1451445Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1451699Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:15:00.1451838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1452160Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1452229Z unimplemented [] 2025-12-04T09:15:00.1452436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1452626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1453811Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1453887Z graph_break [] 2025-12-04T09:15:00.1454016Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1454091Z Autotune Choices Stats: 2025-12-04T09:15:00.1455499Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1455750Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1456016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1456367Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1457492Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1458625Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1459746Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1460898Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1462191Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1463324Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1463577Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:15:00.1463650Z Autotune Choices Stats: 2025-12-04T09:15:00.1465074Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:15:00.1465646Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1465970Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1466537Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1467739Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1468899Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1470106Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1471249Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1472403Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1473545Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1474776Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1476018Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1477170Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1478353Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1478605Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:15:00.1478736Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1478816Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1478878Z unimplemented [] 2025-12-04T09:15:00.1478982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1479179Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1480354Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1480422Z graph_break [] 2025-12-04T09:15:00.1480551Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1480625Z Autotune Choices Stats: 2025-12-04T09:15:00.1482015Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1482329Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1482546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1482862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1484026Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1485133Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1486283Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1487395Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1488501Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1489619Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1489861Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:15:00.1489931Z Autotune Choices Stats: 2025-12-04T09:15:00.1491387Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1491854Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1492178Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1492851Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1494010Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1495210Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1496357Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1497508Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1498662Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1499896Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1501044Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1502219Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1503370Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1504545Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1504798Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:15:00.1504926Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1505005Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1505074Z unimplemented [] 2025-12-04T09:15:00.1505181Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1505376Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1506548Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1511363Z graph_break [] 2025-12-04T09:15:00.1511548Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1511634Z Autotune Choices Stats: 2025-12-04T09:15:00.1513159Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1513457Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1513700Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1514023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1515204Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1516321Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1517482Z triton_flex_attention_801 0.0143 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1518585Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1519693Z triton_flex_attention_800 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1520805Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1521123Z SingleProcess AUTOTUNE benchmarking takes 0.2967 seconds and 1.3426 seconds precompiling for 6 choices 2025-12-04T09:15:00.1521203Z Autotune Choices Stats: 2025-12-04T09:15:00.1522632Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1523075Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1523432Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1524000Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1525206Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1526366Z triton_flex_attention_backward_807 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1527519Z triton_flex_attention_backward_804 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1528673Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1529822Z triton_flex_attention_backward_809 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1531049Z triton_flex_attention_backward_810 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1532233Z triton_flex_attention_backward_813 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1533495Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1534697Z triton_flex_attention_backward_811 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1535841Z triton_flex_attention_backward_816 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1536100Z SingleProcess AUTOTUNE benchmarking takes 0.6691 seconds and 2.4016 seconds precompiling for 13 choices 2025-12-04T09:15:00.1536241Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1536318Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1536382Z unimplemented [] 2025-12-04T09:15:00.1536486Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1536685Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1537868Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1538005Z graph_break [] 2025-12-04T09:15:00.1538139Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1538205Z Autotune Choices Stats: 2025-12-04T09:15:00.1539598Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1539844Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1540101Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1540420Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1541554Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1542695Z triton_flex_attention_822 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1543817Z triton_flex_attention_819 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1544934Z triton_flex_attention_820 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1546032Z triton_flex_attention_817 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1547221Z triton_flex_attention_818 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1547467Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.3304 seconds precompiling for 6 choices 2025-12-04T09:15:00.1547541Z Autotune Choices Stats: 2025-12-04T09:15:00.1548997Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_826", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1549451Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1549775Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1550350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1551559Z triton_flex_attention_backward_826 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1552712Z triton_flex_attention_backward_823 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1553857Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1555003Z triton_flex_attention_backward_825 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1556201Z triton_flex_attention_backward_828 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1557380Z triton_flex_attention_backward_829 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1558528Z triton_flex_attention_backward_830 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1559719Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1560872Z triton_flex_attention_backward_835 0.0173 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1562207Z triton_flex_attention_backward_831 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1562469Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.3188 seconds precompiling for 13 choices 2025-12-04T09:15:00.1562645Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:15:00.1562729Z Traceback (most recent call last): 2025-12-04T09:15:00.1563032Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:15:00.1563110Z self.assertTrue( 2025-12-04T09:15:00.1563428Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:15:00.1563565Z raise self.failureException(msg) 2025-12-04T09:15:00.1563821Z AssertionError: False is not true : Log file /tmp/tmpdv3ka747/flex_attention_configs.json was not created 2025-12-04T09:15:00.1563827Z 2025-12-04T09:15:00.1563964Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.1564221Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.1564225Z 2025-12-04T09:15:00.1564400Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.1564535Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1564607Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1564680Z unimplemented [] 2025-12-04T09:15:00.1564787Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1566028Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:15:00.1566228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1566296Z graph_break [] 2025-12-04T09:15:00.1566426Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1567423Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:15:00.1567516Z current_size = base.storage().size() 2025-12-04T09:15:00.1567639Z Autotune Choices Stats: 2025-12-04T09:15:00.1569053Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.1569307Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1569537Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1569865Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1570991Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1572105Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1573337Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1574489Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1575605Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1576754Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1577013Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:15:00.1577080Z Autotune Choices Stats: 2025-12-04T09:15:00.1578523Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.1578960Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1579301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1579864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1581147Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1582296Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1583502Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1584655Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1585847Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1586992Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1588145Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1589302Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1590513Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1591690Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1591946Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:15:00.1592075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1592145Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1592213Z unimplemented [] 2025-12-04T09:15:00.1592320Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1592506Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1593758Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1593823Z graph_break [] 2025-12-04T09:15:00.1593956Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1594023Z Autotune Choices Stats: 2025-12-04T09:15:00.1595551Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1595795Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1596016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1596348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1597476Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1598649Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1599790Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1600895Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1602039Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1603147Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1603394Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:15:00.1603463Z Autotune Choices Stats: 2025-12-04T09:15:00.1604895Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1605328Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1605654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1606284Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1607447Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1608624Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1609784Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1610962Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1612108Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1613305Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1614438Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1615658Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1616798Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1617980Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1618231Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:15:00.1618363Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1618432Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1618508Z unimplemented [] 2025-12-04T09:15:00.1618612Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1618829Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1620016Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1620077Z graph_break [] 2025-12-04T09:15:00.1620210Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1620278Z Autotune Choices Stats: 2025-12-04T09:15:00.1621674Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1621922Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1622147Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1622461Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1623649Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1624767Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1625931Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1627045Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1628185Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1629296Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1629552Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:15:00.1629620Z Autotune Choices Stats: 2025-12-04T09:15:00.1631042Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1631506Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1631875Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1632435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1633621Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1634772Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1635959Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1637103Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1638250Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1639409Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1640554Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1641763Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1642935Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1644090Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1644345Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:15:00.1644507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1644577Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1644647Z unimplemented [] 2025-12-04T09:15:00.1644749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1644930Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1646113Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1646178Z graph_break [] 2025-12-04T09:15:00.1646320Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1646390Z Autotune Choices Stats: 2025-12-04T09:15:00.1647768Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1648047Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1648296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1648616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1649736Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1650878Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1651992Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1653192Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1654315Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1655432Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1655681Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:15:00.1655748Z Autotune Choices Stats: 2025-12-04T09:15:00.1657175Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1657687Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1658016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1658583Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1659778Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1660925Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1662307Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1663465Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1664623Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1665768Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1667005Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1668198Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1669342Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1670528Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1670777Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:15:00.1670911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1670978Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1671047Z unimplemented [] 2025-12-04T09:15:00.1671150Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1671332Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1672506Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1672568Z graph_break [] 2025-12-04T09:15:00.1672699Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1672765Z Autotune Choices Stats: 2025-12-04T09:15:00.1674157Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1674467Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1674685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1675008Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1676161Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1677299Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1678465Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1679585Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1680707Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1681830Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1682084Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:15:00.1682216Z Autotune Choices Stats: 2025-12-04T09:15:00.1683647Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1684086Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1684455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1685027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1686214Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1687405Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1688558Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1689716Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1690864Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1692121Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1693320Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1694509Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1695680Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1696831Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1697079Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:15:00.1697219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1697291Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1697361Z unimplemented [] 2025-12-04T09:15:00.1697469Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1697657Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1698845Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1698906Z graph_break [] 2025-12-04T09:15:00.1699040Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1699175Z Autotune Choices Stats: 2025-12-04T09:15:00.1700583Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.1700837Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1701058Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1701390Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1702548Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1703673Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1704816Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1705931Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1707051Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1708161Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1708474Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:15:00.1708541Z Autotune Choices Stats: 2025-12-04T09:15:00.1709983Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1710448Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1710781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1711341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1712545Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1713717Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1714875Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1716027Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1717265Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1718449Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1719646Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1720809Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1722006Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1723159Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1723411Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:15:00.1723547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1723614Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1723681Z unimplemented [] 2025-12-04T09:15:00.1723783Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1723964Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1725145Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1725269Z graph_break [] 2025-12-04T09:15:00.1725404Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1725469Z Autotune Choices Stats: 2025-12-04T09:15:00.1726859Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1727109Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1727358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1727684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1728809Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1729969Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1731099Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1732224Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1733425Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1734609Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1734859Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:15:00.1734927Z Autotune Choices Stats: 2025-12-04T09:15:00.1736390Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.1736824Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1737153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1737718Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1738921Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1740071Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1741353Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1742509Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1743728Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1744924Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1746075Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1747266Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1748415Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1749569Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1749814Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:15:00.1749950Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1750018Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1750083Z unimplemented [] 2025-12-04T09:15:00.1750196Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1750425Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1751651Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1751713Z graph_break [] 2025-12-04T09:15:00.1751847Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1751914Z Autotune Choices Stats: 2025-12-04T09:15:00.1753360Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.1753617Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1753835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1754165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1755331Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1756461Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1757576Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1758701Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1759865Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1761007Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1761261Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:15:00.1761337Z Autotune Choices Stats: 2025-12-04T09:15:00.1763029Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1763474Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1763864Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1764425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1765592Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1766751Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1767902Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1769151Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1770342Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1771492Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1772756Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1773920Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1775059Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1776212Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1776462Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:15:00.1776661Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1776731Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1776794Z unimplemented [] 2025-12-04T09:15:00.1776900Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1777083Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1778260Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1778321Z graph_break [] 2025-12-04T09:15:00.1778454Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1778521Z Autotune Choices Stats: 2025-12-04T09:15:00.1779948Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1780198Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1780415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1780773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1781903Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1783024Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1784140Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1785258Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1786455Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1787601Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1787850Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:15:00.1787917Z Autotune Choices Stats: 2025-12-04T09:15:00.1789349Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1789811Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1790135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1790705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1791869Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1793021Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1794240Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1795390Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1796574Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1797720Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1798910Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1800067Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1801214Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1802383Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1802693Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:15:00.1802829Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1802897Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1802959Z unimplemented [] 2025-12-04T09:15:00.1803067Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1803249Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1804454Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1804516Z graph_break [] 2025-12-04T09:15:00.1804644Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1804717Z Autotune Choices Stats: 2025-12-04T09:15:00.1806100Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1806383Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1806600Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1806932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1808056Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1809176Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1810297Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1811478Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1812675Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1813785Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1814033Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:15:00.1814102Z Autotune Choices Stats: 2025-12-04T09:15:00.1815560Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1815998Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1816327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1816904Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1818069Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1819223Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1820465Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1821643Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1822812Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1823993Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1825143Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1826298Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1827443Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1828657Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1828902Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:15:00.1829044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1829115Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1829179Z unimplemented [] 2025-12-04T09:15:00.1829305Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1829527Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1830702Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1830763Z graph_break [] 2025-12-04T09:15:00.1830889Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1830961Z Autotune Choices Stats: 2025-12-04T09:15:00.1832377Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.1832640Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1832855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1833179Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1834311Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1835428Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1836601Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1837717Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1838857Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1839972Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1840255Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:15:00.1840322Z Autotune Choices Stats: 2025-12-04T09:15:00.1841741Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1842175Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1842501Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1843067Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1844229Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1845444Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1846629Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1847779Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1849039Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1850191Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1851343Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1852535Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1853769Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1854922Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1855169Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:15:00.1855338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1855409Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1855470Z unimplemented [] 2025-12-04T09:15:00.1855578Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1855760Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1856937Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1857000Z graph_break [] 2025-12-04T09:15:00.1857163Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1857234Z Autotune Choices Stats: 2025-12-04T09:15:00.1858618Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:15:00.1858868Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1859086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1859415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1860542Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1861653Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1863021Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1864212Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1865340Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1866516Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1866769Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:15:00.1866835Z Autotune Choices Stats: 2025-12-04T09:15:00.1868257Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1868697Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1869024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1869602Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1870880Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1872033Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1873232Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1874414Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1875569Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1876714Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1877872Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1879153Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1880366Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1881575Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1881825Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:15:00.1881960Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1882031Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1882092Z unimplemented [] 2025-12-04T09:15:00.1882202Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1882384Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1883601Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1883666Z graph_break [] 2025-12-04T09:15:00.1883792Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1883876Z Autotune Choices Stats: 2025-12-04T09:15:00.1885261Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1885516Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1885733Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1886060Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1887187Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1888374Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1889522Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1890646Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1891795Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1892970Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1893228Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:15:00.1893307Z Autotune Choices Stats: 2025-12-04T09:15:00.1894737Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.1895174Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1895499Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1896145Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1897312Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1898504Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1899665Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1901082Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1902237Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1903389Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1904543Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1905765Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1906939Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1908090Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1908335Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:15:00.1908468Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1908538Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1908614Z unimplemented [] 2025-12-04T09:15:00.1908759Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1908945Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1910127Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1910187Z graph_break [] 2025-12-04T09:15:00.1910316Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1910389Z Autotune Choices Stats: 2025-12-04T09:15:00.1911776Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1912022Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1912235Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1912649Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1913776Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1914933Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1916065Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1917217Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1918330Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1919448Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1919690Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:15:00.1919762Z Autotune Choices Stats: 2025-12-04T09:15:00.1921181Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.1921686Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1922014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1922584Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1923777Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1924938Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1926127Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1927280Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1928440Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1929595Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1930835Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1931986Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1933242Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1934396Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1934675Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:15:00.1934811Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1934881Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1934943Z unimplemented [] 2025-12-04T09:15:00.1935051Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1935234Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1936412Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1936479Z graph_break [] 2025-12-04T09:15:00.1936606Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1936678Z Autotune Choices Stats: 2025-12-04T09:15:00.1938069Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1938387Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1938603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1938927Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1940053Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1941202Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1942315Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1943465Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1944579Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1945698Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1945943Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:15:00.1946015Z Autotune Choices Stats: 2025-12-04T09:15:00.1947444Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.1947978Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1948301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1948906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1950082Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1951274Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1952427Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1953580Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1954737Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1955885Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1957099Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1958278Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1959431Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1960625Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1960881Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:15:00.1961014Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1961082Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1961144Z unimplemented [] 2025-12-04T09:15:00.1961249Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1961436Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1962771Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1962842Z graph_break [] 2025-12-04T09:15:00.1962973Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1963044Z Autotune Choices Stats: 2025-12-04T09:15:00.1964436Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1964817Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1965035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1965355Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1966530Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1967652Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1968813Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1969936Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1971054Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1972169Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1972527Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:15:00.1972613Z Autotune Choices Stats: 2025-12-04T09:15:00.1974037Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:15:00.1974475Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1974845Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1975413Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1976580Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1977800Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1978946Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1980102Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1981248Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1982451Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.1983684Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.1984831Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.1986014Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1987165Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.1987411Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:15:00.1987541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.1987619Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.1987681Z unimplemented [] 2025-12-04T09:15:00.1987792Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.1987976Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.1989148Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.1989278Z graph_break [] 2025-12-04T09:15:00.1989406Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.1989480Z Autotune Choices Stats: 2025-12-04T09:15:00.1990876Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.1991127Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.1991348Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.1991698Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.1992825Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1993970Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1995084Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.1996199Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1997307Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.1998426Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.1998732Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:15:00.1998804Z Autotune Choices Stats: 2025-12-04T09:15:00.2000272Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2000713Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2001041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2001607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2002816Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2003987Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2005152Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2006316Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2007544Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2008685Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2009896Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2011082Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2012240Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2013428Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2013679Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:15:00.2013813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2013881Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2013941Z unimplemented [] 2025-12-04T09:15:00.2014051Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2014235Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2015413Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2015563Z graph_break [] 2025-12-04T09:15:00.2015694Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2015766Z Autotune Choices Stats: 2025-12-04T09:15:00.2017187Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.2017439Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2017653Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2017975Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2019127Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2020241Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2021358Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2022480Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2023609Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2024796Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2025040Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:15:00.2025113Z Autotune Choices Stats: 2025-12-04T09:15:00.2026570Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2027017Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2027339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2027949Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2029122Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2030290Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2031445Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2032634Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2033815Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2034993Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2036144Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2037321Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2038467Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2039628Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2039872Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:15:00.2039998Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2040106Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2040219Z unimplemented [] 2025-12-04T09:15:00.2040331Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2040521Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2041694Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2041761Z graph_break [] 2025-12-04T09:15:00.2041886Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2041956Z Autotune Choices Stats: 2025-12-04T09:15:00.2043376Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2043624Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2043844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2044173Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2045338Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2046449Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2047565Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2048683Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2049858Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2050970Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2051248Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:15:00.2051323Z Autotune Choices Stats: 2025-12-04T09:15:00.2052800Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2053241Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2053604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2054176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2055343Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2056498Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2057653Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2058865Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2060055Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2061206Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2062615Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2063792Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2064973Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2066125Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2066465Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:15:00.2066596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2066671Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2066733Z unimplemented [] 2025-12-04T09:15:00.2066841Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2067024Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2068196Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2068264Z graph_break [] 2025-12-04T09:15:00.2068445Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2068520Z Autotune Choices Stats: 2025-12-04T09:15:00.2069915Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2070166Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2070417Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2070736Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2071865Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2072990Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2074105Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2075314Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2076420Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2077576Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2077820Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:15:00.2077890Z Autotune Choices Stats: 2025-12-04T09:15:00.2079341Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2079782Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2080110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2080680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2081844Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2083006Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2084218Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2085402Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2086556Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2087736Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2088892Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2090037Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2091190Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2092383Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2092699Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:15:00.2092827Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2092900Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2092961Z unimplemented [] 2025-12-04T09:15:00.2093067Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2093250Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2094460Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2094542Z graph_break [] 2025-12-04T09:15:00.2094670Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2094744Z Autotune Choices Stats: 2025-12-04T09:15:00.2096162Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2096411Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2096627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2096943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2098073Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2099187Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2100300Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2101474Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2102615Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2103739Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2103983Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:15:00.2104056Z Autotune Choices Stats: 2025-12-04T09:15:00.2105531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.2105967Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2106292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2106864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2108018Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2109241Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2110390Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2111581Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2112736Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2113916Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2115066Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2116213Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2117366Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2118625Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2118870Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:15:00.2119004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2119080Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2119142Z unimplemented [] 2025-12-04T09:15:00.2119277Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2119467Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2120635Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2120702Z graph_break [] 2025-12-04T09:15:00.2120829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2120902Z Autotune Choices Stats: 2025-12-04T09:15:00.2122332Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2122584Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2122801Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2123123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2124247Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2125362Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2126536Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2127680Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2128793Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2129947Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2130190Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:15:00.2130266Z Autotune Choices Stats: 2025-12-04T09:15:00.2131691Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2132132Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2132498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2133062Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2134230Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2135443Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2136645Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2137795Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2138989Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2140139Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2141299Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2142451Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2143663Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2144836Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2145090Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:15:00.2145215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2145289Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2145352Z unimplemented [] 2025-12-04T09:15:00.2145454Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2145644Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2146854Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2146927Z graph_break [] 2025-12-04T09:15:00.2147052Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2147126Z Autotune Choices Stats: 2025-12-04T09:15:00.2148507Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2148759Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2148976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2149294Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2150427Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2151600Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2152722Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2153874Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2154980Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2156126Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2156369Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:15:00.2156439Z Autotune Choices Stats: 2025-12-04T09:15:00.2157859Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:15:00.2158298Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2158622Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2159250Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2160406Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2161603Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2162939Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2164165Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2165319Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2166472Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2167637Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2168909Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2170066Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2171262Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2171516Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:15:00.2171644Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2171718Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2171779Z unimplemented [] 2025-12-04T09:15:00.2171883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2172076Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2173338Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2173413Z graph_break [] 2025-12-04T09:15:00.2173540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2173606Z Autotune Choices Stats: 2025-12-04T09:15:00.2175008Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2175256Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2175472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2175791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2176953Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2178088Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2179239Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2180355Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2181503Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2182619Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2182865Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:15:00.2182938Z Autotune Choices Stats: 2025-12-04T09:15:00.2184362Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2184798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2185196Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2185769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2186929Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2188121Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2189312Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2190468Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2191616Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2192770Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2193926Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2195133Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2196328Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2197474Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2197724Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:15:00.2197882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2197959Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2198024Z unimplemented [] 2025-12-04T09:15:00.2198125Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2198310Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2199486Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2199558Z graph_break [] 2025-12-04T09:15:00.2199683Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2199752Z Autotune Choices Stats: 2025-12-04T09:15:00.2201142Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.2201381Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2201688Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2202006Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2203128Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2204273Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2205406Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2206554Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2207668Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2208791Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2209036Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:15:00.2209106Z Autotune Choices Stats: 2025-12-04T09:15:00.2210534Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2211037Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2211356Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2211928Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2213200Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2214358Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2215543Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2216694Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2217847Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2219001Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2220213Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2221381Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2222541Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2223716Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2223968Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:15:00.2224094Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2224178Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2224241Z unimplemented [] 2025-12-04T09:15:00.2224343Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2224535Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2225706Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2225774Z graph_break [] 2025-12-04T09:15:00.2225902Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2225967Z Autotune Choices Stats: 2025-12-04T09:15:00.2227362Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.2227669Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2227892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2228206Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2229384Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2230495Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2231663Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2232794Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2233915Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2235040Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2235280Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:15:00.2235389Z Autotune Choices Stats: 2025-12-04T09:15:00.2236848Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2237287Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2237608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2238208Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2239371Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2240563Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2241719Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2242880Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2244030Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2245243Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2246402Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2247581Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2248769Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2249920Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2250170Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:15:00.2250297Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2250374Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2250437Z unimplemented [] 2025-12-04T09:15:00.2250538Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2250727Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2251902Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2251967Z graph_break [] 2025-12-04T09:15:00.2252093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2252240Z Autotune Choices Stats: 2025-12-04T09:15:00.2253705Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2253947Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2254169Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2254487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2255658Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2256763Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2257924Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2259042Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2260164Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2261292Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2261600Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:15:00.2261675Z Autotune Choices Stats: 2025-12-04T09:15:00.2263260Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.2263778Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2264113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2264677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2265897Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2267060Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2268213Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2269369Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2270517Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2271775Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2272972Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2274121Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2275323Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2276469Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2276741Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:15:00.2276871Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2276945Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2277007Z unimplemented [] 2025-12-04T09:15:00.2277109Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2277294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2278464Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2278600Z graph_break [] 2025-12-04T09:15:00.2278735Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2278803Z Autotune Choices Stats: 2025-12-04T09:15:00.2280198Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2280442Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2280696Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2281010Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2282142Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2283285Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2284400Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2285520Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2286647Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2287830Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2288072Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:15:00.2288138Z Autotune Choices Stats: 2025-12-04T09:15:00.2289598Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.2290043Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2290364Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2290927Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2292126Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2293322Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2294472Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2295630Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2296861Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2298050Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2299207Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2300398Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2301555Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2302706Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2302958Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:15:00.2303087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2303161Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2303223Z unimplemented [] 2025-12-04T09:15:00.2303326Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2303513Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2304745Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2304814Z graph_break [] 2025-12-04T09:15:00.2304942Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2305008Z Autotune Choices Stats: 2025-12-04T09:15:00.2306434Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2306685Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2306907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2307223Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2308383Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2309507Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2310624Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2311734Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2312857Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2314041Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2314284Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:15:00.2314367Z Autotune Choices Stats: 2025-12-04T09:15:00.2315834Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.2316267Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2316634Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2317196Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2318369Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2319536Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2320691Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2321914Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2323096Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2324256Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2325447Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2326594Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2327755Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2328900Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2329150Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:15:00.2329353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2329426Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2329491Z unimplemented [] 2025-12-04T09:15:00.2329593Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2329780Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2330949Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2331017Z graph_break [] 2025-12-04T09:15:00.2331141Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2331206Z Autotune Choices Stats: 2025-12-04T09:15:00.2332735Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2332984Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2333211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2333569Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2334693Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2335804Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2336926Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2338040Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2339217Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2340363Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2340609Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:15:00.2340677Z Autotune Choices Stats: 2025-12-04T09:15:00.2342129Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2342783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2343115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2343680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2344862Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2346026Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2347240Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2348393Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2349575Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2350724Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2351901Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2353047Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2354198Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2355339Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2355651Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:15:00.2355779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2355854Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2355918Z unimplemented [] 2025-12-04T09:15:00.2356021Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2356209Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2357431Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2357501Z graph_break [] 2025-12-04T09:15:00.2357626Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2357695Z Autotune Choices Stats: 2025-12-04T09:15:00.2359086Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2359380Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2359607Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2359932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2361057Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2362348Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2363483Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2364722Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2365887Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2367010Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2367258Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:15:00.2367326Z Autotune Choices Stats: 2025-12-04T09:15:00.2368805Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:15:00.2369239Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2369567Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2370134Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2371307Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2372510Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2373739Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2374925Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2376076Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2377264Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2378410Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2379577Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2380732Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2381944Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2382193Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:15:00.2382322Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2382391Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2382468Z unimplemented [] 2025-12-04T09:15:00.2382575Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2382765Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2383974Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2384048Z graph_break [] 2025-12-04T09:15:00.2384174Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2384241Z Autotune Choices Stats: 2025-12-04T09:15:00.2385704Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2385953Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2386180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2386500Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2387640Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2388749Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2389950Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2391061Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2392236Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2393351Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2393636Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:15:00.2393705Z Autotune Choices Stats: 2025-12-04T09:15:00.2395131Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.2395560Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2395893Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2396455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2397626Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2398861Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2400062Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2401221Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2402407Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2403562Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2404713Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2405874Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2407064Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2408249Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2408501Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:15:00.2408662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2408733Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2408803Z unimplemented [] 2025-12-04T09:15:00.2408906Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2409094Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2410269Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2410331Z graph_break [] 2025-12-04T09:15:00.2410504Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2410571Z Autotune Choices Stats: 2025-12-04T09:15:00.2411965Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2412206Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2412472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2412790Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2413916Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2415022Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2416219Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2417359Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2418481Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2419638Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2419892Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:15:00.2419959Z Autotune Choices Stats: 2025-12-04T09:15:00.2421381Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.2421820Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2422154Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2422719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2423972Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2425123Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2426320Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2427482Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2428669Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2429830Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2430989Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2432146Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2433361Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2434543Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2434796Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:15:00.2434922Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2434991Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2435061Z unimplemented [] 2025-12-04T09:15:00.2435163Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2435349Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2436557Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2436622Z graph_break [] 2025-12-04T09:15:00.2436756Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2436822Z Autotune Choices Stats: 2025-12-04T09:15:00.2438223Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2438466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2438702Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2439024Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2440156Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2441353Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2442501Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2443618Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2444765Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2445877Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2446129Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:15:00.2446199Z Autotune Choices Stats: 2025-12-04T09:15:00.2447634Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2448064Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2448398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2449021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2450201Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2451392Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2452599Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2453794Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2454945Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2456103Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2457255Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2458495Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2459690Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2460846Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2461094Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:15:00.2461224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2461307Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2461377Z unimplemented [] 2025-12-04T09:15:00.2461516Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2461699Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2463037Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2463101Z graph_break [] 2025-12-04T09:15:00.2463235Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2463304Z Autotune Choices Stats: 2025-12-04T09:15:00.2464705Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2464947Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2465173Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2465595Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2466719Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2467833Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2468999Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2470111Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2471281Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2472401Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2472649Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:15:00.2472714Z Autotune Choices Stats: 2025-12-04T09:15:00.2474154Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2474652Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2474981Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2475547Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2476756Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2477913Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2479100Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2480256Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2481403Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2482560Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2483767Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2484915Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2486098Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2487246Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2487548Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:15:00.2487677Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2487745Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2487811Z unimplemented [] 2025-12-04T09:15:00.2487912Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2488092Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2489272Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2489343Z graph_break [] 2025-12-04T09:15:00.2489477Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2489543Z Autotune Choices Stats: 2025-12-04T09:15:00.2490934Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2491247Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2491474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2491794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2492953Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2494108Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2495225Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2496371Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2497490Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2498606Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2498855Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:15:00.2498921Z Autotune Choices Stats: 2025-12-04T09:15:00.2500359Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:15:00.2500849Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2501178Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2506269Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2507519Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2508741Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2509905Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2511080Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2512262Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2513426Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2514637Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2515828Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2516975Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2518158Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2518425Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:15:00.2518571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2518647Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2518718Z unimplemented [] 2025-12-04T09:15:00.2518827Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2519018Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2520215Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2520284Z graph_break [] 2025-12-04T09:15:00.2520424Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2520494Z Autotune Choices Stats: 2025-12-04T09:15:00.2521907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2522228Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2522449Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2522780Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2524025Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2525152Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2526306Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2527410Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2528533Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2529644Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2529962Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:15:00.2530031Z Autotune Choices Stats: 2025-12-04T09:15:00.2531467Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2531901Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2532275Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2532939Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2534125Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2535320Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2536477Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2537634Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2538791Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2540005Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2541187Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2542340Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2543516Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2544668Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2544923Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:15:00.2545057Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2545129Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2545199Z unimplemented [] 2025-12-04T09:15:00.2545305Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2545492Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2546685Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2546815Z graph_break [] 2025-12-04T09:15:00.2546952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2547023Z Autotune Choices Stats: 2025-12-04T09:15:00.2548422Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2548668Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2548887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2549246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2550372Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2551543Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2552666Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2553789Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2554909Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2556020Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2556352Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:15:00.2556418Z Autotune Choices Stats: 2025-12-04T09:15:00.2557885Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2558322Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2558651Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2559214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2560429Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2561589Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2562939Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2564104Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2565379Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2566530Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2567728Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2568928Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2570083Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2571240Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2571489Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:15:00.2571626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2571695Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2571766Z unimplemented [] 2025-12-04T09:15:00.2571871Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2572053Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2573275Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2573412Z graph_break [] 2025-12-04T09:15:00.2573548Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2573617Z Autotune Choices Stats: 2025-12-04T09:15:00.2575005Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2575293Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2575516Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2575846Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2576968Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2578126Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2579249Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2580367Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2581483Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2582654Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2582918Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:15:00.2582986Z Autotune Choices Stats: 2025-12-04T09:15:00.2584443Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2584883Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2585210Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2585809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2586974Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2588126Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2589291Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2590491Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2591675Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2592866Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2594017Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2595201Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2596346Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2597503Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2597747Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:15:00.2597878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2597946Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2598046Z unimplemented [] 2025-12-04T09:15:00.2598178Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2598363Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2599538Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2599599Z graph_break [] 2025-12-04T09:15:00.2599730Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2599795Z Autotune Choices Stats: 2025-12-04T09:15:00.2601220Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2601470Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2601697Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2602020Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2603179Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2604294Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2605413Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2606522Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2607702Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2608805Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2609087Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:15:00.2609155Z Autotune Choices Stats: 2025-12-04T09:15:00.2610590Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2611019Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2611381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2611940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2613161Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2614320Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2615477Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2616691Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2617876Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2619032Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2620232Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2621385Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2622534Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2623689Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2623999Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:15:00.2624136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2624205Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2624274Z unimplemented [] 2025-12-04T09:15:00.2624376Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2624560Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2625756Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2625820Z graph_break [] 2025-12-04T09:15:00.2625985Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2626054Z Autotune Choices Stats: 2025-12-04T09:15:00.2627437Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2627688Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2627941Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2628264Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2629397Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2630517Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2631631Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2632782Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2633931Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2635077Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2635326Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:15:00.2635390Z Autotune Choices Stats: 2025-12-04T09:15:00.2636851Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:15:00.2637286Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2637616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2638177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2639342Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2640488Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2641712Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2642890Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2644046Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2645239Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2646394Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2647560Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2648708Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2649873Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2650179Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:15:00.2650311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2650379Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2650439Z unimplemented [] 2025-12-04T09:15:00.2650544Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2650727Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2651957Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2652019Z graph_break [] 2025-12-04T09:15:00.2652153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2652221Z Autotune Choices Stats: 2025-12-04T09:15:00.2653696Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2653949Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2654167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2654490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2655618Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2656732Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2657848Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2659066Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2660225Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2661336Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2661582Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:15:00.2661653Z Autotune Choices Stats: 2025-12-04T09:15:00.2663324Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2663767Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2664103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2664673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2665857Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2667113Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2668268Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2669468Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2670625Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2671814Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2672969Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2674120Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2675263Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2676475Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2676720Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:15:00.2676866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2676940Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2677005Z unimplemented [] 2025-12-04T09:15:00.2677158Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2677346Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2678526Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2678588Z graph_break [] 2025-12-04T09:15:00.2678717Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2678795Z Autotune Choices Stats: 2025-12-04T09:15:00.2680211Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2680464Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2680682Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2681012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2682134Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2683260Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2684451Z triton_flex_attention_801 0.0143 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2685595Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2686712Z triton_flex_attention_800 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2687853Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2688106Z SingleProcess AUTOTUNE benchmarking takes 0.2967 seconds and 1.3426 seconds precompiling for 6 choices 2025-12-04T09:15:00.2688172Z Autotune Choices Stats: 2025-12-04T09:15:00.2689611Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2690051Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2690387Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2690960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2692118Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2693377Z triton_flex_attention_backward_807 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2694560Z triton_flex_attention_backward_804 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2695703Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2696890Z triton_flex_attention_backward_809 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2698030Z triton_flex_attention_backward_810 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2699188Z triton_flex_attention_backward_813 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2700339Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2701557Z triton_flex_attention_backward_811 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2702711Z triton_flex_attention_backward_816 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2702993Z SingleProcess AUTOTUNE benchmarking takes 0.6691 seconds and 2.4016 seconds precompiling for 13 choices 2025-12-04T09:15:00.2703133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2703202Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2703269Z unimplemented [] 2025-12-04T09:15:00.2703382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2703578Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2704793Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2704859Z graph_break [] 2025-12-04T09:15:00.2704985Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2705059Z Autotune Choices Stats: 2025-12-04T09:15:00.2706444Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2706708Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2706922Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2707243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2708357Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2709546Z triton_flex_attention_822 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2710659Z triton_flex_attention_819 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2711810Z triton_flex_attention_820 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2712923Z triton_flex_attention_817 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2714079Z triton_flex_attention_818 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2714333Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.3304 seconds precompiling for 6 choices 2025-12-04T09:15:00.2714399Z Autotune Choices Stats: 2025-12-04T09:15:00.2715824Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_826", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2716260Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2716581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2717230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2718398Z triton_flex_attention_backward_826 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2719589Z triton_flex_attention_backward_823 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2720745Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2721914Z triton_flex_attention_backward_825 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2723064Z triton_flex_attention_backward_828 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2724206Z triton_flex_attention_backward_829 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2725356Z triton_flex_attention_backward_830 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2726537Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2727714Z triton_flex_attention_backward_835 0.0173 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2728900Z triton_flex_attention_backward_831 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2729143Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.3188 seconds precompiling for 13 choices 2025-12-04T09:15:00.2729281Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2729348Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2729411Z unimplemented [] 2025-12-04T09:15:00.2729515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2729699Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2730904Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2730966Z graph_break [] 2025-12-04T09:15:00.2731091Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2731165Z Autotune Choices Stats: 2025-12-04T09:15:00.2732599Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2732850Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2733066Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2733389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2734551Z triton_flex_attention_840 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2735698Z triton_flex_attention_841 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2736837Z triton_flex_attention_838 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2737954Z triton_flex_attention_836 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2739096Z triton_flex_attention_839 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2740229Z triton_flex_attention_837 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2740479Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3626 seconds precompiling for 6 choices 2025-12-04T09:15:00.2740547Z Autotune Choices Stats: 2025-12-04T09:15:00.2741961Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_844", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2742395Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2742785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2743352Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2744506Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2745707Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2746910Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2748060Z triton_flex_attention_backward_843 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2749217Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2750363Z triton_flex_attention_backward_847 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2751512Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2752727Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2753898Z triton_flex_attention_backward_849 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2755045Z triton_flex_attention_backward_854 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2755289Z SingleProcess AUTOTUNE benchmarking takes 0.6694 seconds and 2.3460 seconds precompiling for 13 choices 2025-12-04T09:15:00.2755468Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:15:00.2755581Z Traceback (most recent call last): 2025-12-04T09:15:00.2755884Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:15:00.2755963Z self.assertTrue( 2025-12-04T09:15:00.2756163Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:15:00.2756244Z raise self.failureException(msg) 2025-12-04T09:15:00.2756490Z AssertionError: False is not true : Log file /tmp/tmptn9lux0k/flex_attention_configs.json was not created 2025-12-04T09:15:00.2756495Z 2025-12-04T09:15:00.2756627Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.2756888Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.2756895Z 2025-12-04T09:15:00.2757059Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.2757193Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2757274Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2757338Z unimplemented [] 2025-12-04T09:15:00.2757447Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2758631Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:15:00.2758859Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2758952Z graph_break [] 2025-12-04T09:15:00.2759082Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2760077Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:15:00.2760158Z current_size = base.storage().size() 2025-12-04T09:15:00.2760232Z Autotune Choices Stats: 2025-12-04T09:15:00.2761653Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.2762072Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2762304Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2762629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2763822Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2764933Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2766042Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2767167Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2768278Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2769503Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2769749Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:15:00.2769832Z Autotune Choices Stats: 2025-12-04T09:15:00.2771312Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.2771755Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2772113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2772726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2773890Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2775049Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2776192Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2777406Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2778548Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2779747Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2780927Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2782066Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2783215Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2784365Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2784618Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:15:00.2784798Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2784907Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2784970Z unimplemented [] 2025-12-04T09:15:00.2785075Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2785269Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2786454Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2786520Z graph_break [] 2025-12-04T09:15:00.2786647Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2786717Z Autotune Choices Stats: 2025-12-04T09:15:00.2788149Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2788402Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2788619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2788937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2790349Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2791466Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2792582Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2793699Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2794912Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2796059Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2796307Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:15:00.2796379Z Autotune Choices Stats: 2025-12-04T09:15:00.2797802Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2798278Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2798606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2799176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2800345Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2801505Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2802649Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2803860Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2805038Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2806192Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2807374Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2808516Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2809672Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2810815Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2811146Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:15:00.2811274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2811350Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2811413Z unimplemented [] 2025-12-04T09:15:00.2811516Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2811706Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2812976Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2813050Z graph_break [] 2025-12-04T09:15:00.2813177Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2813245Z Autotune Choices Stats: 2025-12-04T09:15:00.2814631Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2814912Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2815139Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2815456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2816582Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2817696Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2818816Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2819997Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2821101Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2822248Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2822491Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:15:00.2822561Z Autotune Choices Stats: 2025-12-04T09:15:00.2824015Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2824453Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2824777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2825349Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2826508Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2827668Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2828883Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2830069Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2831213Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2832400Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2833550Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2834696Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2835841Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2837046Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2837295Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:15:00.2837420Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2837494Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2837555Z unimplemented [] 2025-12-04T09:15:00.2837655Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2837843Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2839048Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2839116Z graph_break [] 2025-12-04T09:15:00.2839243Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2839307Z Autotune Choices Stats: 2025-12-04T09:15:00.2840721Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2840966Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2841186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2841498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2842629Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2843737Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2844900Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2846032Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2847172Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2848286Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2848530Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:15:00.2848603Z Autotune Choices Stats: 2025-12-04T09:15:00.2850065Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2850500Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2850829Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2851387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2852591Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2853806Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2854950Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2856137Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2857313Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2858471Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2859636Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2860782Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2862088Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2863371Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2863626Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:15:00.2863757Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2863882Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2863946Z unimplemented [] 2025-12-04T09:15:00.2864048Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2864247Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2865421Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2865491Z graph_break [] 2025-12-04T09:15:00.2865620Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2865690Z Autotune Choices Stats: 2025-12-04T09:15:00.2867138Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2867382Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2867603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2867927Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2869059Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2870172Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2871354Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2872492Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2873604Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2874769Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2875020Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:15:00.2875087Z Autotune Choices Stats: 2025-12-04T09:15:00.2876507Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2876953Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2877280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2877841Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2879044Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2880237Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2881422Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2882572Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2883752Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2884895Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2886043Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2887185Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2888394Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2889567Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2889835Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:15:00.2889965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2890041Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2890106Z unimplemented [] 2025-12-04T09:15:00.2890208Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2890394Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2891603Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2891672Z graph_break [] 2025-12-04T09:15:00.2891796Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2891860Z Autotune Choices Stats: 2025-12-04T09:15:00.2893289Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.2893534Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2893763Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2894082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2895222Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2896402Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2897521Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2898668Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2899789Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2900936Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2901180Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:15:00.2901246Z Autotune Choices Stats: 2025-12-04T09:15:00.2902678Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2903112Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2903438Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2904057Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2905221Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2906424Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2907577Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2908760Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2909911Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2911063Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2912216Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2913425Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2914581Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2915760Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2916011Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:15:00.2916138Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2916212Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2916276Z unimplemented [] 2025-12-04T09:15:00.2916377Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2916581Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2917787Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2917852Z graph_break [] 2025-12-04T09:15:00.2917979Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2918044Z Autotune Choices Stats: 2025-12-04T09:15:00.2919433Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2919675Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2919895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2920209Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2921401Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2922515Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2923668Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2924781Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2925975Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2927092Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2927339Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:15:00.2927405Z Autotune Choices Stats: 2025-12-04T09:15:00.2928827Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.2929254Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2929661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2930224Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2931430Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2932648Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2933838Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2934998Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2936148Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2937305Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2938455Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2939694Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2940884Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2942045Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2942300Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:15:00.2942462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2942541Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2942605Z unimplemented [] 2025-12-04T09:15:00.2942707Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2942898Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2944074Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2944146Z graph_break [] 2025-12-04T09:15:00.2944275Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2944340Z Autotune Choices Stats: 2025-12-04T09:15:00.2945726Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.2945966Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2946255Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2946567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2947701Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2948846Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2949967Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2951114Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2952233Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2953354Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2953596Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:15:00.2953661Z Autotune Choices Stats: 2025-12-04T09:15:00.2955091Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2955581Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2955908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2956469Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2957662Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2958812Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2960009Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2961165Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2962496Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2963673Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2964962Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2966161Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2967316Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2968508Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2968764Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:15:00.2968892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2968960Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2969027Z unimplemented [] 2025-12-04T09:15:00.2969125Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2969314Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2970502Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2970570Z graph_break [] 2025-12-04T09:15:00.2970695Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2970761Z Autotune Choices Stats: 2025-12-04T09:15:00.2972153Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2972560Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2972780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2973094Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2974260Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2975375Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2976519Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2977634Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.2978764Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.2979884Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2980128Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:15:00.2980257Z Autotune Choices Stats: 2025-12-04T09:15:00.2981679Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.2982112Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2982470Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2983036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.2984199Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2985381Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2986537Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2987694Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2988838Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2990067Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.2991210Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.2992398Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.2993593Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.2994742Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.2994990Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:15:00.2995117Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.2995189Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.2995257Z unimplemented [] 2025-12-04T09:15:00.2995361Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.2995547Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.2996719Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.2996780Z graph_break [] 2025-12-04T09:15:00.2996910Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.2997045Z Autotune Choices Stats: 2025-12-04T09:15:00.2998439Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.2998683Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.2998904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.2999223Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3000396Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3001506Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3002677Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3003793Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3004914Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3006027Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3006337Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:15:00.3006414Z Autotune Choices Stats: 2025-12-04T09:15:00.3007838Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3008301Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3008633Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3009199Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3010397Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3011553Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3012752Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3013918Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3015110Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3016298Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3017478Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3018633Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3019825Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3020971Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3021223Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:15:00.3021351Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3021418Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3021489Z unimplemented [] 2025-12-04T09:15:00.3021588Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3021777Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3022955Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3023093Z graph_break [] 2025-12-04T09:15:00.3023225Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3023290Z Autotune Choices Stats: 2025-12-04T09:15:00.3024679Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.3024956Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3025177Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3025489Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3026620Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3027764Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3028881Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3030000Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3031112Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3032284Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3032530Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:15:00.3032596Z Autotune Choices Stats: 2025-12-04T09:15:00.3034083Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3034514Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3034840Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3035434Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3036598Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3037751Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3038913Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3040068Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3041273Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3042474Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3043620Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3044804Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3045952Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3047106Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3047356Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:15:00.3047481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3047548Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3047618Z unimplemented [] 2025-12-04T09:15:00.3047717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3047949Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3049165Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3049225Z graph_break [] 2025-12-04T09:15:00.3049359Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3049425Z Autotune Choices Stats: 2025-12-04T09:15:00.3050849Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:15:00.3051094Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3051314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3051628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3052852Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3053982Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3055103Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3056215Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3057361Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3058497Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3058750Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:15:00.3058814Z Autotune Choices Stats: 2025-12-04T09:15:00.3060318Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3060749Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3061112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3061672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3062997Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3064162Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3065327Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3066620Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3067826Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3068984Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3070180Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3071337Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3072490Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3073640Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3073887Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:15:00.3074076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3074148Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3074216Z unimplemented [] 2025-12-04T09:15:00.3074317Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3074501Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3075689Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3075764Z graph_break [] 2025-12-04T09:15:00.3075898Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3075962Z Autotune Choices Stats: 2025-12-04T09:15:00.3077381Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3077625Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3077846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3078196Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3079319Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3080443Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3081560Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3082673Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3083851Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3084998Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3085249Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:15:00.3085314Z Autotune Choices Stats: 2025-12-04T09:15:00.3086770Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.3087204Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3087542Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3088109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3089277Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3090432Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3091655Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3092863Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3094055Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3095207Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3096395Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3097547Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3098699Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3099859Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3100198Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:15:00.3100326Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3100393Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3100460Z unimplemented [] 2025-12-04T09:15:00.3100559Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3100740Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3101958Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3102022Z graph_break [] 2025-12-04T09:15:00.3102154Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3102220Z Autotune Choices Stats: 2025-12-04T09:15:00.3103608Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3103881Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3104103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3104429Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3105549Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3106672Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3107792Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3108960Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3110109Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3111222Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3111469Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:15:00.3111533Z Autotune Choices Stats: 2025-12-04T09:15:00.3112994Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3113427Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3113757Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3114327Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3115490Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3116641Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3117849Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3119029Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3120192Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3121378Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3122522Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3123679Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3124826Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3126042Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3126292Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:15:00.3126419Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3126487Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3126558Z unimplemented [] 2025-12-04T09:15:00.3126659Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3126872Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3128054Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3128117Z graph_break [] 2025-12-04T09:15:00.3128248Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3128316Z Autotune Choices Stats: 2025-12-04T09:15:00.3129766Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3130010Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3130226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3130552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3131695Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3132876Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3134070Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3135182Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3136340Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3137449Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3137733Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:15:00.3137804Z Autotune Choices Stats: 2025-12-04T09:15:00.3139232Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.3139669Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3140000Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3140562Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3141732Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3142943Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3144131Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3145286Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3146470Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3147623Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3148769Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3149931Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3151150Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3152301Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3152552Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:15:00.3152712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3152780Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3152847Z unimplemented [] 2025-12-04T09:15:00.3152948Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3153130Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3154325Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3154386Z graph_break [] 2025-12-04T09:15:00.3154552Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3154618Z Autotune Choices Stats: 2025-12-04T09:15:00.3156014Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3156254Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3156472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3156793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3157916Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3159067Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3160227Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3161384Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3162671Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3163875Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3164129Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:15:00.3164197Z Autotune Choices Stats: 2025-12-04T09:15:00.3165641Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:15:00.3166087Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3166418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3166979Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3168240Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3169402Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3170598Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3171782Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3172983Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3174149Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3175301Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3176453Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3177668Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3178850Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3179096Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:15:00.3179230Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3179297Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3179369Z unimplemented [] 2025-12-04T09:15:00.3179470Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3179650Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3180868Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3180932Z graph_break [] 2025-12-04T09:15:00.3181061Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3181126Z Autotune Choices Stats: 2025-12-04T09:15:00.3182516Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3182760Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3182973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3183297Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3184422Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3185594Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3186747Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3187856Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3189000Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3190116Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3190376Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:15:00.3190444Z Autotune Choices Stats: 2025-12-04T09:15:00.3191875Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3192306Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3192673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3193323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3194487Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3195670Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3196822Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3198004Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3199158Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3200314Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3201462Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3202672Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3203854Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3205006Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3205247Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:15:00.3205380Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3205449Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3205515Z unimplemented [] 2025-12-04T09:15:00.3205648Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3205832Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3207011Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3207071Z graph_break [] 2025-12-04T09:15:00.3207210Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3207281Z Autotune Choices Stats: 2025-12-04T09:15:00.3208670Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.3208913Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3209126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3209508Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3210632Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3211772Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3212971Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3214125Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3215239Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3216352Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3216600Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:15:00.3216666Z Autotune Choices Stats: 2025-12-04T09:15:00.3218091Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3218591Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3218917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3219483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3220676Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3221827Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3223024Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3224176Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3225333Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3226486Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3227712Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3228862Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3230045Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3231197Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3231703Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:15:00.3231840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3231908Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3231974Z unimplemented [] 2025-12-04T09:15:00.3232074Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3232254Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3233432Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3233493Z graph_break [] 2025-12-04T09:15:00.3233624Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3233688Z Autotune Choices Stats: 2025-12-04T09:15:00.3235067Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3235380Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3235601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3235917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3237040Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3238200Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3239316Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3240468Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3241588Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3242701Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3242949Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:15:00.3243016Z Autotune Choices Stats: 2025-12-04T09:15:00.3244446Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3244940Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3245268Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3245864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3247032Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3248213Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3249371Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3250521Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3251681Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3252900Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3254119Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3255308Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3256454Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3257655Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3257901Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:15:00.3258033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3258102Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3258165Z unimplemented [] 2025-12-04T09:15:00.3258276Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3258459Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3259644Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3259705Z graph_break [] 2025-12-04T09:15:00.3259838Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3259906Z Autotune Choices Stats: 2025-12-04T09:15:00.3261291Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3261604Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3261965Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3262300Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3263488Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3264614Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3265782Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3266892Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3268017Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3269130Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3269471Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:15:00.3269539Z Autotune Choices Stats: 2025-12-04T09:15:00.3270957Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3271384Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3271748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3272306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3273468Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3274649Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3275809Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3276962Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3278110Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3279319Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3280499Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3281661Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3282844Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3283996Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3284238Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:15:00.3284372Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3284442Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3284506Z unimplemented [] 2025-12-04T09:15:00.3284612Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3284792Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3285969Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3286096Z graph_break [] 2025-12-04T09:15:00.3286228Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3286306Z Autotune Choices Stats: 2025-12-04T09:15:00.3287683Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3287929Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3288146Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3288518Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3289647Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3290801Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3291910Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3293072Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3294198Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3295320Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3295635Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:15:00.3295703Z Autotune Choices Stats: 2025-12-04T09:15:00.3297163Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.3297599Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3297928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3298490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3299686Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3300843Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3301998Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3303148Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3304366Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3305551Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3306704Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3307887Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3309036Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3310184Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3310429Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:15:00.3310559Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3310628Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3310691Z unimplemented [] 2025-12-04T09:15:00.3310797Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3310977Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3312168Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3312295Z graph_break [] 2025-12-04T09:15:00.3312423Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3312497Z Autotune Choices Stats: 2025-12-04T09:15:00.3313910Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3314161Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3314380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3314702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3315853Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3316968Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3318085Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3319207Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3320325Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3321516Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3321759Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:15:00.3321825Z Autotune Choices Stats: 2025-12-04T09:15:00.3323281Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3323718Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3324041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3324638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3325799Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3326951Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3328102Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3329361Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3330516Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3331693Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3332885Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3334097Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3335251Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3336414Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3336663Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:15:00.3336796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3336898Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3336991Z unimplemented [] 2025-12-04T09:15:00.3337097Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3337279Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3338455Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3338513Z graph_break [] 2025-12-04T09:15:00.3338637Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3338719Z Autotune Choices Stats: 2025-12-04T09:15:00.3340139Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3340384Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3340597Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3340918Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3342076Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3343189Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3344306Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3345421Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3346587Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3347708Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3347988Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:15:00.3348056Z Autotune Choices Stats: 2025-12-04T09:15:00.3349478Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:15:00.3349958Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3350279Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3350842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3352006Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3353164Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3354316Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3355539Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3356730Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3357887Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3359080Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3360233Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3361386Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3362702Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3363061Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:15:00.3363201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3363269Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3363334Z unimplemented [] 2025-12-04T09:15:00.3363444Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3363625Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3364801Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3364912Z graph_break [] 2025-12-04T09:15:00.3365044Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3365117Z Autotune Choices Stats: 2025-12-04T09:15:00.3366504Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3366771Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3367039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3367365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3368493Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3369619Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3370730Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3371907Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3373068Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3374226Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3374490Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:15:00.3374560Z Autotune Choices Stats: 2025-12-04T09:15:00.3376024Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3376467Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3376790Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3377362Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3378531Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3379689Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3380904Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3382088Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3383242Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3384437Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3385593Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3386742Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3387889Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3389072Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3389364Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:15:00.3389500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3389568Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3389629Z unimplemented [] 2025-12-04T09:15:00.3389735Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3389920Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3391141Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3391205Z graph_break [] 2025-12-04T09:15:00.3391329Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3391409Z Autotune Choices Stats: 2025-12-04T09:15:00.3392831Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.3393084Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3393298Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3393617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3394750Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3395870Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3396986Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3398160Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3399305Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3400422Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3400671Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:15:00.3400742Z Autotune Choices Stats: 2025-12-04T09:15:00.3402192Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3402630Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3402956Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3403524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3404680Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3405900Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3407054Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3408246Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3409398Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3410574Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3411727Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3412925Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3414077Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3415294Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3415537Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:15:00.3415674Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3415744Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3415857Z unimplemented [] 2025-12-04T09:15:00.3415969Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3416151Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3417329Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3417390Z graph_break [] 2025-12-04T09:15:00.3417517Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3417591Z Autotune Choices Stats: 2025-12-04T09:15:00.3419007Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.3419253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3419465Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3419799Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3420923Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3422040Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3423209Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3424357Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3425467Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3426613Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3426859Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:15:00.3426933Z Autotune Choices Stats: 2025-12-04T09:15:00.3428352Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3428791Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3429112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3429680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3430835Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3432049Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3433236Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3434385Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3435641Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3436807Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3437972Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3439118Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3440336Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3441526Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3441774Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:15:00.3441908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3441976Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3442040Z unimplemented [] 2025-12-04T09:15:00.3442148Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3442332Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3443541Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3443606Z graph_break [] 2025-12-04T09:15:00.3443733Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3443806Z Autotune Choices Stats: 2025-12-04T09:15:00.3445201Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3445463Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3445686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3446012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3447133Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3448330Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3449450Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3450598Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3451708Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3452924Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3453174Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:15:00.3453250Z Autotune Choices Stats: 2025-12-04T09:15:00.3454677Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.3455118Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3455445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3456081Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3457239Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3458433Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3459590Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3460814Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3462127Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3463322Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3464508Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3465784Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3466942Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3468151Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3468406Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:15:00.3468549Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3468623Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3468687Z unimplemented [] 2025-12-04T09:15:00.3468801Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3468985Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3470209Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3470281Z graph_break [] 2025-12-04T09:15:00.3470414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3470497Z Autotune Choices Stats: 2025-12-04T09:15:00.3471892Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.3472148Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3472368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3472695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3473886Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3475006Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3476149Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3477263Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3478410Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3479528Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3479776Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:15:00.3479852Z Autotune Choices Stats: 2025-12-04T09:15:00.3481278Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.3481718Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3482126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3482699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3483900Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3485064Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3486245Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3487409Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3488563Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3489723Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3490871Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3492078Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3493319Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3494478Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3494727Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:15:00.3494897Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3494970Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3495032Z unimplemented [] 2025-12-04T09:15:00.3495142Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3495330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3500318Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3500430Z graph_break [] 2025-12-04T09:15:00.3500588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3500677Z Autotune Choices Stats: 2025-12-04T09:15:00.3502108Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3502369Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3502698Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3503025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3504170Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3505330Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3506444Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3507598Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3508726Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3509866Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3510118Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:15:00.3510195Z Autotune Choices Stats: 2025-12-04T09:15:00.3511636Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.3512153Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3512483Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3513062Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3514280Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3515445Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3516652Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3517813Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3518967Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3520110Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3521322Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3522500Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3523653Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3524843Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3525096Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:15:00.3525234Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3525314Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3525377Z unimplemented [] 2025-12-04T09:15:00.3525491Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3525677Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3526859Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3526928Z graph_break [] 2025-12-04T09:15:00.3527057Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3527129Z Autotune Choices Stats: 2025-12-04T09:15:00.3528526Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.3528855Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3529077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3529400Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3530580Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3531700Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3532963Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3534114Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3535241Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3536367Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3536613Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:15:00.3536720Z Autotune Choices Stats: 2025-12-04T09:15:00.3538200Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.3538643Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3538970Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3539586Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3540754Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3544457Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3545614Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3546791Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3547964Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3549177Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3550333Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3551553Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3552726Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3553954Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3554208Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:15:00.3554351Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3554429Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3554492Z unimplemented [] 2025-12-04T09:15:00.3554604Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3554795Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3555975Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3556035Z graph_break [] 2025-12-04T09:15:00.3556168Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3556278Z Autotune Choices Stats: 2025-12-04T09:15:00.3557679Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.3557931Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3558150Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3558483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3559648Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3560774Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3562238Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3563395Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3564539Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3565659Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3565968Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:15:00.3566040Z Autotune Choices Stats: 2025-12-04T09:15:00.3567482Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:15:00.3567966Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3568301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3568863Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3570076Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3571283Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3572479Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3573626Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3574813Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3575967Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3577158Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3578304Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3579528Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3580678Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3580931Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:15:00.3581077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3581149Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3581214Z unimplemented [] 2025-12-04T09:15:00.3581326Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3581514Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3582695Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3582816Z graph_break [] 2025-12-04T09:15:00.3582955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3583023Z Autotune Choices Stats: 2025-12-04T09:15:00.3584427Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3584682Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3584937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3585265Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3586396Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3587588Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3588699Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3589835Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3590955Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3592108Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3592362Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:15:00.3592429Z Autotune Choices Stats: 2025-12-04T09:15:00.3593906Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.3594353Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3594678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3595245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3596477Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3597637Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3598793Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3599938Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3601128Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3602343Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3603496Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3604687Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3607808Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3610224Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3611693Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:15:00.3612150Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3612583Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3612783Z unimplemented [] 2025-12-04T09:15:00.3612985Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3613422Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3614893Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3616203Z graph_break [] 2025-12-04T09:15:00.3616438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3616718Z Autotune Choices Stats: 2025-12-04T09:15:00.3618292Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3620006Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3620550Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3621164Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3622735Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3625092Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3627390Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3629686Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3632021Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3634311Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3635744Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:15:00.3636123Z Autotune Choices Stats: 2025-12-04T09:15:00.3637713Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.3639634Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3640541Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3641518Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3643323Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3645703Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3648074Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3650496Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3652942Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3655319Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3657722Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3660134Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3662690Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3665090Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3666562Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:15:00.3667101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3667384Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3667578Z unimplemented [] 2025-12-04T09:15:00.3667799Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3668166Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3669601Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3670900Z graph_break [] 2025-12-04T09:15:00.3671129Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3671410Z Autotune Choices Stats: 2025-12-04T09:15:00.3672973Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.3674692Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3675224Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3675948Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3677483Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3679800Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3682092Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3684392Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3686730Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3689070Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3690530Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:15:00.3690914Z Autotune Choices Stats: 2025-12-04T09:15:00.3692496Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.3694543Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3695380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3696345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3698162Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3700560Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3702990Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3705358Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3707770Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3710149Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3712596Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3714975Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3717359Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3719739Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3721260Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:15:00.3721713Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3721991Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3722175Z unimplemented [] 2025-12-04T09:15:00.3722379Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3722738Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3724211Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3725508Z graph_break [] 2025-12-04T09:15:00.3725727Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3726002Z Autotune Choices Stats: 2025-12-04T09:15:00.3727503Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.3729260Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3729793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3730403Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3731926Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3734317Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3736619Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3738969Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3741327Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3743626Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3745061Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:15:00.3745445Z Autotune Choices Stats: 2025-12-04T09:15:00.3747034Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.3748991Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3749820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3750781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3752580Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3754962Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3757375Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3759810Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3762349Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3765218Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3767618Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3770020Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3772428Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3774870Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3776354Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:15:00.3776811Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3777093Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3777287Z unimplemented [] 2025-12-04T09:15:00.3777486Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3777916Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3779367Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3780667Z graph_break [] 2025-12-04T09:15:00.3780889Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3781162Z Autotune Choices Stats: 2025-12-04T09:15:00.3782697Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.3784441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3784997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3785608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3787131Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3789438Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3791787Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3794104Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3796454Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3798757Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3800252Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:15:00.3800638Z Autotune Choices Stats: 2025-12-04T09:15:00.3802183Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:15:00.3804104Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3804937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3805896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3807695Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3810122Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3812577Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3814963Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3817362Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3819766Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3822145Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3824536Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3826956Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3829323Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3830792Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:15:00.3831279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3831559Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3831757Z unimplemented [] 2025-12-04T09:15:00.3831963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3832323Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3833777Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3835114Z graph_break [] 2025-12-04T09:15:00.3835374Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3835661Z Autotune Choices Stats: 2025-12-04T09:15:00.3837162Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.3838855Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3839402Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3840017Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3841542Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3843847Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3846202Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3848541Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3850866Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3853278Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3854708Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:15:00.3855094Z Autotune Choices Stats: 2025-12-04T09:15:00.3856628Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.3858545Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3859380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3860344Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3862362Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3864755Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3867214Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3869644Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3872076Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3874472Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3876852Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3879223Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3881691Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3884093Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3885581Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:15:00.3886037Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3886320Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3886507Z unimplemented [] 2025-12-04T09:15:00.3886711Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3887072Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3888619Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3889918Z graph_break [] 2025-12-04T09:15:00.3890136Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3890409Z Autotune Choices Stats: 2025-12-04T09:15:00.3891913Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.3893651Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3894186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3894793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3896334Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3898705Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3901043Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3903349Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3905680Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3908017Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3909455Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:15:00.3909844Z Autotune Choices Stats: 2025-12-04T09:15:00.3911387Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.3913303Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3914133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3915135Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3916941Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3919362Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3921749Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3924186Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3926555Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3928933Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3931309Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3933764Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3936169Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3938548Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.3940024Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:15:00.3940470Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3940786Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3940980Z unimplemented [] 2025-12-04T09:15:00.3941209Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3941582Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3943024Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3944326Z graph_break [] 2025-12-04T09:15:00.3944549Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3944821Z Autotune Choices Stats: 2025-12-04T09:15:00.3946323Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.3948006Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3948543Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3949208Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3950735Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3953047Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3955399Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.3957749Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3960096Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.3962576Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3964023Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:15:00.3964405Z Autotune Choices Stats: 2025-12-04T09:15:00.3965943Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.3967952Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.3968785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.3969755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.3971617Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3974073Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3976509Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3978931Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3981307Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3983667Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3986087Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.3988453Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.3990849Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.3993217Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.3994758Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:15:00.3995212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.3995486Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.3995679Z unimplemented [] 2025-12-04T09:15:00.3995883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.3996244Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.3997678Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.3998978Z graph_break [] 2025-12-04T09:15:00.3999199Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.3999478Z Autotune Choices Stats: 2025-12-04T09:15:00.4000984Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4002720Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4003250Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4003856Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4005391Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4007735Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4010041Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4012489Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4014814Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4017124Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4018555Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:15:00.4018938Z Autotune Choices Stats: 2025-12-04T09:15:00.4020480Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.4022444Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4023268Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4024282Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4026098Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4028510Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4030927Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4033298Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4035697Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4038064Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4040480Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4042881Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4045254Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4047679Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4049151Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:15:00.4049630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4049907Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4050097Z unimplemented [] 2025-12-04T09:15:00.4050294Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4050654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4052113Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4053463Z graph_break [] 2025-12-04T09:15:00.4053681Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4053958Z Autotune Choices Stats: 2025-12-04T09:15:00.4055458Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.4057211Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4057748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4058350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4059916Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4062362Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4064735Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4067100Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4069425Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4071714Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4073200Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:15:00.4073587Z Autotune Choices Stats: 2025-12-04T09:15:00.4075121Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:15:00.4077036Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4077936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4078907Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4080712Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4083161Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4085530Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4087895Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4090271Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4092761Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4095168Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4097545Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4099940Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4102352Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4103826Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:15:00.4104279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4104567Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4104763Z unimplemented [] 2025-12-04T09:15:00.4104966Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4105323Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4106782Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4108121Z graph_break [] 2025-12-04T09:15:00.4108347Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4108618Z Autotune Choices Stats: 2025-12-04T09:15:00.4110116Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4111807Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4112346Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4112991Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4114535Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4116862Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4119189Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4121477Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4123769Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4126070Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4127531Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:15:00.4127919Z Autotune Choices Stats: 2025-12-04T09:15:00.4129496Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.4131415Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4132238Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4133251Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4135130Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4137504Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4139880Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4142240Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4144651Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4147031Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4149440Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4151832Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4154293Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4156663Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4158129Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:15:00.4158579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4158857Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4159042Z unimplemented [] 2025-12-04T09:15:00.4159242Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4159610Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4161042Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4162585Z graph_break [] 2025-12-04T09:15:00.4162809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4163084Z Autotune Choices Stats: 2025-12-04T09:15:00.4164654Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.4166365Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4166897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4167522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4169037Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4171449Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4173815Z triton_flex_attention_801 0.0143 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4176118Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4178405Z triton_flex_attention_800 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4180763Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4182197Z SingleProcess AUTOTUNE benchmarking takes 0.2967 seconds and 1.3426 seconds precompiling for 6 choices 2025-12-04T09:15:00.4182582Z Autotune Choices Stats: 2025-12-04T09:15:00.4184166Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.4186081Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4186902Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4187929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4189732Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4192111Z triton_flex_attention_backward_807 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4194485Z triton_flex_attention_backward_804 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4196888Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4199265Z triton_flex_attention_backward_809 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4201670Z triton_flex_attention_backward_810 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4204033Z triton_flex_attention_backward_813 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4206460Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4208831Z triton_flex_attention_backward_811 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4211210Z triton_flex_attention_backward_816 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4212732Z SingleProcess AUTOTUNE benchmarking takes 0.6691 seconds and 2.4016 seconds precompiling for 13 choices 2025-12-04T09:15:00.4213188Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4213463Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4213702Z unimplemented [] 2025-12-04T09:15:00.4213899Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4214263Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4215697Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4216993Z graph_break [] 2025-12-04T09:15:00.4217211Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4217489Z Autotune Choices Stats: 2025-12-04T09:15:00.4219027Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.4220724Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4221257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4221862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4223475Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4225787Z triton_flex_attention_822 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4228093Z triton_flex_attention_819 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4230395Z triton_flex_attention_820 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4232724Z triton_flex_attention_817 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4235017Z triton_flex_attention_818 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4236486Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.3304 seconds precompiling for 6 choices 2025-12-04T09:15:00.4236874Z Autotune Choices Stats: 2025-12-04T09:15:00.4238412Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_826", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.4240372Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4241234Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4242198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4244007Z triton_flex_attention_backward_826 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4246384Z triton_flex_attention_backward_823 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4248756Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4251163Z triton_flex_attention_backward_825 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4253688Z triton_flex_attention_backward_828 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4256066Z triton_flex_attention_backward_829 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4258463Z triton_flex_attention_backward_830 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4260859Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4263409Z triton_flex_attention_backward_835 0.0173 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4265774Z triton_flex_attention_backward_831 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4267311Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.3188 seconds precompiling for 13 choices 2025-12-04T09:15:00.4267767Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4268041Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4268232Z unimplemented [] 2025-12-04T09:15:00.4268436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4268788Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4270225Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4271521Z graph_break [] 2025-12-04T09:15:00.4271797Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4272077Z Autotune Choices Stats: 2025-12-04T09:15:00.4273586Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4275327Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4275911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4276525Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4278054Z triton_flex_attention_840 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4280356Z triton_flex_attention_841 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4282652Z triton_flex_attention_838 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4284983Z triton_flex_attention_836 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4287277Z triton_flex_attention_839 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4289653Z triton_flex_attention_837 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4291089Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3626 seconds precompiling for 6 choices 2025-12-04T09:15:00.4291469Z Autotune Choices Stats: 2025-12-04T09:15:00.4293091Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_844", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.4295050Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4295899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4296862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4298667Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4301054Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4303470Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4305882Z triton_flex_attention_backward_843 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4308256Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4310653Z triton_flex_attention_backward_847 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4313059Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4315431Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4317796Z triton_flex_attention_backward_849 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4320163Z triton_flex_attention_backward_854 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4321661Z SingleProcess AUTOTUNE benchmarking takes 0.6694 seconds and 2.3460 seconds precompiling for 13 choices 2025-12-04T09:15:00.4322109Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4322384Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4322569Z unimplemented [] 2025-12-04T09:15:00.4322766Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4323139Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4324619Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4325904Z graph_break [] 2025-12-04T09:15:00.4326120Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4326394Z Autotune Choices Stats: 2025-12-04T09:15:00.4327925Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4329642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4330180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4330792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4332368Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4334690Z triton_flex_attention_860 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4336988Z triton_flex_attention_855 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4339322Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4341664Z triton_flex_attention_858 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4343968Z triton_flex_attention_856 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4345402Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3732 seconds precompiling for 6 choices 2025-12-04T09:15:00.4345833Z Autotune Choices Stats: 2025-12-04T09:15:00.4347397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.4349331Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4350167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4351127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4352931Z triton_flex_attention_backward_862 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4355370Z triton_flex_attention_backward_864 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4357740Z triton_flex_attention_backward_861 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4360152Z triton_flex_attention_backward_863 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4362703Z triton_flex_attention_backward_866 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4363988Z triton_flex_attention_backward_865 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4365149Z triton_flex_attention_backward_867 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4366301Z triton_flex_attention_backward_868 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4367451Z triton_flex_attention_backward_870 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4368647Z triton_flex_attention_backward_873 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4368900Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.4237 seconds precompiling for 13 choices 2025-12-04T09:15:00.4369081Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:15:00.4369175Z Traceback (most recent call last): 2025-12-04T09:15:00.4369532Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:15:00.4369602Z self.assertTrue( 2025-12-04T09:15:00.4369810Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:15:00.4369892Z raise self.failureException(msg) 2025-12-04T09:15:00.4370142Z AssertionError: False is not true : Log file /tmp/tmp8f2jgdv8/flex_attention_configs.json was not created 2025-12-04T09:15:00.4370147Z 2025-12-04T09:15:00.4370284Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.4370542Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.4370547Z 2025-12-04T09:15:00.4370719Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.4370904Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4370977Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4371084Z unimplemented [] 2025-12-04T09:15:00.4371191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4372417Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:15:00.4372613Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4372680Z graph_break [] 2025-12-04T09:15:00.4372814Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4373799Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:15:00.4373887Z current_size = base.storage().size() 2025-12-04T09:15:00.4373954Z Autotune Choices Stats: 2025-12-04T09:15:00.4375355Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.4375639Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4375864Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4376180Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4377334Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4378453Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4379612Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4380751Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4381865Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4382975Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4383224Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:15:00.4383289Z Autotune Choices Stats: 2025-12-04T09:15:00.4384758Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.4385203Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4385536Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4386137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4387304Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4388486Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4389684Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4390840Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4391980Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4393164Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4394303Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4395484Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4396631Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4397889Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4398140Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:15:00.4398270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4398341Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4398413Z unimplemented [] 2025-12-04T09:15:00.4398516Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4398703Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4399882Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4399944Z graph_break [] 2025-12-04T09:15:00.4400081Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4400149Z Autotune Choices Stats: 2025-12-04T09:15:00.4401577Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4401819Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4402042Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4402360Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4403516Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4404631Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4405808Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4406921Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4408044Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4409156Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4409440Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:15:00.4409506Z Autotune Choices Stats: 2025-12-04T09:15:00.4410937Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4411370Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4411740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4412344Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4413564Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4414752Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4415907Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4417054Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4418195Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4419385Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4420566Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4421736Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4422968Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4424117Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4424368Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:15:00.4424500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4424571Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4424640Z unimplemented [] 2025-12-04T09:15:00.4424742Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4424925Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4426108Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4426218Z graph_break [] 2025-12-04T09:15:00.4426355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4426423Z Autotune Choices Stats: 2025-12-04T09:15:00.4427812Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4428055Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4428313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4428628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4429748Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4430910Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4432058Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4433180Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4434289Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4435430Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4435677Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:15:00.4435747Z Autotune Choices Stats: 2025-12-04T09:15:00.4437201Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4437647Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4437980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4438539Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4439780Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4440930Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4442092Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4443236Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4444417Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4445594Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4446736Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4447916Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4449088Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4450237Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4450487Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:15:00.4450616Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4450684Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4450752Z unimplemented [] 2025-12-04T09:15:00.4450854Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4451035Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4452287Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4452387Z graph_break [] 2025-12-04T09:15:00.4452531Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4452599Z Autotune Choices Stats: 2025-12-04T09:15:00.4454043Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4454291Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4454507Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4454827Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4455984Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4457131Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4458250Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4459359Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4460471Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4461625Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4462040Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:15:00.4462115Z Autotune Choices Stats: 2025-12-04T09:15:00.4463622Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4464057Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4464444Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4465059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4466231Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4467385Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4468544Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4469747Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4470895Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4472078Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4473220Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4474435Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4475576Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4476727Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4476971Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:15:00.4477125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4477235Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4477306Z unimplemented [] 2025-12-04T09:15:00.4477412Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4477595Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4478784Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4478845Z graph_break [] 2025-12-04T09:15:00.4478979Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4484980Z Autotune Choices Stats: 2025-12-04T09:15:00.4486560Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4486850Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4487079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4487456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4488638Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4489770Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4490901Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4492044Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4493329Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4494476Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4494742Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:15:00.4494814Z Autotune Choices Stats: 2025-12-04T09:15:00.4496256Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4496793Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4497135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4497701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4498873Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4500037Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4501194Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4502372Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4503552Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4504690Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4505901Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4507044Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4508187Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4509339Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4509628Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:15:00.4509774Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4509847Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4509913Z unimplemented [] 2025-12-04T09:15:00.4510042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4510236Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4511456Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4511526Z graph_break [] 2025-12-04T09:15:00.4511667Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4511738Z Autotune Choices Stats: 2025-12-04T09:15:00.4513140Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.4513429Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4513684Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4514008Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4515139Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4516260Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4517370Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4518518Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4519634Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4520785Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4521043Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:15:00.4521111Z Autotune Choices Stats: 2025-12-04T09:15:00.4522809Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4523315Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4523671Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4524243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4525438Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4526606Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4527804Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4528995Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4530162Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4531340Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4532583Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4533743Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4534894Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4536088Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4536342Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:15:00.4536487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4536559Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4536625Z unimplemented [] 2025-12-04T09:15:00.4536740Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4536932Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4538202Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4538278Z graph_break [] 2025-12-04T09:15:00.4538411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4538486Z Autotune Choices Stats: 2025-12-04T09:15:00.4539918Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4540207Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4540428Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4540750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4541895Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4543077Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4544195Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4545355Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4546504Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4547617Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4547923Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:15:00.4547991Z Autotune Choices Stats: 2025-12-04T09:15:00.4549443Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.4549882Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4550213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4550780Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4551938Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4553139Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4554294Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4555480Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4556684Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4557864Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4559019Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4560177Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4561326Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4562745Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4563008Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:15:00.4563156Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4563333Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4563410Z unimplemented [] 2025-12-04T09:15:00.4563528Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4563716Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4564936Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4565060Z graph_break [] 2025-12-04T09:15:00.4565190Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4565263Z Autotune Choices Stats: 2025-12-04T09:15:00.4566712Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.4566964Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4567191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4567523Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4568658Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4569796Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4570963Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4572113Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4573296Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4574448Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4574732Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:15:00.4574801Z Autotune Choices Stats: 2025-12-04T09:15:00.4576229Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4576671Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4576995Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4577563Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4578765Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4579925Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4581115Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4582267Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4583487Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4584640Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4585801Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4586953Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4588130Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4589334Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4589582Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:15:00.4589721Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4589791Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4589853Z unimplemented [] 2025-12-04T09:15:00.4589962Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4590145Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4591383Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4591482Z graph_break [] 2025-12-04T09:15:00.4591611Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4591686Z Autotune Choices Stats: 2025-12-04T09:15:00.4593075Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4593331Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4593551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4593879Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4595015Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4596172Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4597287Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4598447Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4599591Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4600740Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4600992Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:15:00.4601059Z Autotune Choices Stats: 2025-12-04T09:15:00.4602487Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4602927Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4603250Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4603854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4605026Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4606216Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4607372Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4608549Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4609734Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4610882Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4612030Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4613255Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4614402Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4615605Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4615855Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:15:00.4615994Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4616064Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4616132Z unimplemented [] 2025-12-04T09:15:00.4616276Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4616490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4617676Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4617738Z graph_break [] 2025-12-04T09:15:00.4617865Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4617941Z Autotune Choices Stats: 2025-12-04T09:15:00.4619321Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4619576Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4619794Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4620118Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4621302Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4622420Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4623571Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4624694Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4625870Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4626989Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4627247Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:15:00.4627313Z Autotune Choices Stats: 2025-12-04T09:15:00.4628735Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4629205Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4629533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4630101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4631302Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4632476Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4633664Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4634861Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4636018Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4637169Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4638329Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4639519Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4640700Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4641852Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4642131Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:15:00.4642300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4642374Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4642436Z unimplemented [] 2025-12-04T09:15:00.4642547Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4642730Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4643917Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4643981Z graph_break [] 2025-12-04T09:15:00.4644111Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4644184Z Autotune Choices Stats: 2025-12-04T09:15:00.4645564Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.4645859Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4646081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4646402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4647528Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4648682Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4649813Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4650963Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4652124Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4653309Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4653554Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:15:00.4653626Z Autotune Choices Stats: 2025-12-04T09:15:00.4655055Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4655532Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4655856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4656421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4657627Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4658792Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4660030Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4661183Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4662510Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4663661Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4664913Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4666108Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4667273Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4668474Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4668766Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:15:00.4668907Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4668978Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4669042Z unimplemented [] 2025-12-04T09:15:00.4669151Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4669336Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4670524Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4670587Z graph_break [] 2025-12-04T09:15:00.4670723Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4670809Z Autotune Choices Stats: 2025-12-04T09:15:00.4672199Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:15:00.4672536Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4672757Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4673076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4674239Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4675364Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4676515Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4677672Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4678786Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4679907Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4680151Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:15:00.4680259Z Autotune Choices Stats: 2025-12-04T09:15:00.4681684Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4682122Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4682477Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4683049Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4684210Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4685455Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4686618Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4687775Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4688954Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4690137Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4691331Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4692532Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4693722Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4694911Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4695155Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:15:00.4695300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4695376Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4695441Z unimplemented [] 2025-12-04T09:15:00.4695551Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4695733Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4696903Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4696969Z graph_break [] 2025-12-04T09:15:00.4697095Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4697210Z Autotune Choices Stats: 2025-12-04T09:15:00.4698593Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4698836Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4699052Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4699372Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4700534Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4701658Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4702833Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4703956Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4705073Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4706192Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4706467Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:15:00.4706541Z Autotune Choices Stats: 2025-12-04T09:15:00.4707968Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.4708448Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4708778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4709345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4710542Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4711740Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4712896Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4714053Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4715247Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4716397Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4717609Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4718759Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4719980Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4721137Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4721385Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:15:00.4721519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4721587Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4721660Z unimplemented [] 2025-12-04T09:15:00.4721769Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4721952Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4723128Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4723234Z graph_break [] 2025-12-04T09:15:00.4723362Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4723434Z Autotune Choices Stats: 2025-12-04T09:15:00.4724810Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4725093Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4725312Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4725628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4726770Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4727967Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4729088Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4730213Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4731336Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4732547Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4732795Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:15:00.4732868Z Autotune Choices Stats: 2025-12-04T09:15:00.4734333Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4734776Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4735103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4735740Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4736899Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4738060Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4739210Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4740365Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4741555Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4742747Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4743902Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4745087Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4746276Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4747440Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4747686Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:15:00.4747813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4747888Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4747954Z unimplemented [] 2025-12-04T09:15:00.4748066Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4748303Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4749475Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4749549Z graph_break [] 2025-12-04T09:15:00.4749679Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4749754Z Autotune Choices Stats: 2025-12-04T09:15:00.4751172Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4751436Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4751655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4751972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4753143Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4754302Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4755420Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4756556Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4757718Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4758837Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4759086Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:15:00.4759160Z Autotune Choices Stats: 2025-12-04T09:15:00.4760619Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.4761061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4761451Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4762191Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4763362Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4764531Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4765689Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4766945Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4768153Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4769309Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4770511Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4771713Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4772919Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4774083Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4774374Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:15:00.4774510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4774588Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4774653Z unimplemented [] 2025-12-04T09:15:00.4774763Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4774951Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4776130Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4776199Z graph_break [] 2025-12-04T09:15:00.4776328Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4776437Z Autotune Choices Stats: 2025-12-04T09:15:00.4777833Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4778083Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4778340Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4778709Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4779859Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4780990Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4782115Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4783239Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4784389Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4785548Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4785797Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:15:00.4785872Z Autotune Choices Stats: 2025-12-04T09:15:00.4787343Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:15:00.4787813Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4788139Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4788712Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4789887Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4791046Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4792233Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4793392Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4794588Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4795735Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4796976Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4798128Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4799282Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4800438Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4800719Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:15:00.4800849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4800926Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4800991Z unimplemented [] 2025-12-04T09:15:00.4801095Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4801293Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4802506Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4802584Z graph_break [] 2025-12-04T09:15:00.4802713Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4802786Z Autotune Choices Stats: 2025-12-04T09:15:00.4804218Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4804557Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4804777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4805094Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4806237Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4807350Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4808478Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4809636Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4810786Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4811909Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4812154Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:15:00.4812227Z Autotune Choices Stats: 2025-12-04T09:15:00.4813754Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4814227Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4814548Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4815117Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4816279Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4817441Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4818632Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4819825Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4820982Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4822203Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4823357Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4824511Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4825665Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4826863Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4827120Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:15:00.4827251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4827328Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4827392Z unimplemented [] 2025-12-04T09:15:00.4827497Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4827727Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4828910Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4828977Z graph_break [] 2025-12-04T09:15:00.4829104Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4829175Z Autotune Choices Stats: 2025-12-04T09:15:00.4830627Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.4830878Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4831095Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4831410Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4832548Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4833657Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4834818Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4835939Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4837087Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4838210Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4838518Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:15:00.4838593Z Autotune Choices Stats: 2025-12-04T09:15:00.4840011Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4840453Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4840777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4841350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4842519Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4843717Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4844922Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4846088Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4847291Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4848483Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4849640Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4850791Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4851982Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4853187Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4853481Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:15:00.4853617Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4853694Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4853760Z unimplemented [] 2025-12-04T09:15:00.4853866Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4854057Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4855229Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4855335Z graph_break [] 2025-12-04T09:15:00.4855495Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4855578Z Autotune Choices Stats: 2025-12-04T09:15:00.4856959Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4857212Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4857432Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4857746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4858885Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4860033Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4861154Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4862513Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4863665Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4864899Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4865152Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:15:00.4865229Z Autotune Choices Stats: 2025-12-04T09:15:00.4866663Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4867103Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4867429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4868003Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4869218Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4870425Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4871577Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4872763Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4873942Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4875093Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4876246Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4877401Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4878608Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4879796Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4880055Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:15:00.4880186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4880273Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4880339Z unimplemented [] 2025-12-04T09:15:00.4880442Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4880635Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4881874Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4881942Z graph_break [] 2025-12-04T09:15:00.4882072Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4882138Z Autotune Choices Stats: 2025-12-04T09:15:00.4883533Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4883779Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4884005Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4884323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4885460Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4886617Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4887776Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4888898Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4890046Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4891211Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4891456Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:15:00.4891531Z Autotune Choices Stats: 2025-12-04T09:15:00.4893066Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4893505Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4893875Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4894438Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4895605Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4896811Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4897959Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4899177Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4900329Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4901491Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4902658Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4903842Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4905036Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4906188Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4906447Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:15:00.4906633Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4906707Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4906783Z unimplemented [] 2025-12-04T09:15:00.4906941Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4907137Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4908321Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4908390Z graph_break [] 2025-12-04T09:15:00.4908535Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4908614Z Autotune Choices Stats: 2025-12-04T09:15:00.4910026Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4910268Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4910494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4910850Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4912003Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4913157Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4914279Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4915428Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4916586Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4917706Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4917965Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:15:00.4918037Z Autotune Choices Stats: 2025-12-04T09:15:00.4919478Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.4919947Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4920281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4920845Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4922051Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4923230Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4924428Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4925621Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4926783Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4927942Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4929146Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4930306Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4931504Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4932729Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4933025Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:15:00.4933159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4933229Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4933301Z unimplemented [] 2025-12-04T09:15:00.4933407Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4933592Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4934790Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4934854Z graph_break [] 2025-12-04T09:15:00.4934989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4935055Z Autotune Choices Stats: 2025-12-04T09:15:00.4936445Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4936727Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4936949Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4937267Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4938415Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4939589Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4940715Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4941974Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4943099Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4944223Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4944472Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:15:00.4944538Z Autotune Choices Stats: 2025-12-04T09:15:00.4945977Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4946444Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4946777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4947376Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4948549Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4949732Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4950921Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4952082Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4953234Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4954387Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4955577Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4956769Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4957933Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4959143Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4959394Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:15:00.4959523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4959595Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4959664Z unimplemented [] 2025-12-04T09:15:00.4959768Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4959954Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4961140Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4961202Z graph_break [] 2025-12-04T09:15:00.4961335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4961402Z Autotune Choices Stats: 2025-12-04T09:15:00.4962964Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4963284Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4963508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4963824Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4965004Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4966123Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4967670Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4968785Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4969916Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4971030Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4971316Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:15:00.4971387Z Autotune Choices Stats: 2025-12-04T09:15:00.4972879Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:15:00.4973308Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4973691Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4974251Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4975415Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4976632Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4977793Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4978949Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4980107Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4981293Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.4982471Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.4983628Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.4984810Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4986000Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.4986259Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:15:00.4986393Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.4986462Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.4986532Z unimplemented [] 2025-12-04T09:15:00.4986638Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.4986824Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.4988002Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.4988113Z graph_break [] 2025-12-04T09:15:00.4988248Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.4988317Z Autotune Choices Stats: 2025-12-04T09:15:00.4989712Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.4989951Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4990176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.4990530Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.4991659Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4992819Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4993976Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.4995089Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4996210Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.4997323Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.4997618Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:15:00.4997685Z Autotune Choices Stats: 2025-12-04T09:15:00.4999143Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.4999578Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.4999907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5000474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5001713Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5002863Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5004020Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5005178Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5006402Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5007599Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5008754Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5009947Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5011127Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5012290Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5012606Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:15:00.5012737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5012808Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5012877Z unimplemented [] 2025-12-04T09:15:00.5012996Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5013185Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5014366Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5014468Z graph_break [] 2025-12-04T09:15:00.5014604Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5014670Z Autotune Choices Stats: 2025-12-04T09:15:00.5016098Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.5016344Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5016566Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5016893Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5018057Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5019209Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5020331Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5021446Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5022570Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5023717Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5023972Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:15:00.5024041Z Autotune Choices Stats: 2025-12-04T09:15:00.5025503Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5025941Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5026268Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5026903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5028076Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5029235Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5030394Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5031579Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5032738Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5033930Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5035081Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5036322Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5037479Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5038640Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5038895Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:15:00.5039026Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5039130Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5039204Z unimplemented [] 2025-12-04T09:15:00.5039307Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5039501Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5040681Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5040742Z graph_break [] 2025-12-04T09:15:00.5040891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5040962Z Autotune Choices Stats: 2025-12-04T09:15:00.5042393Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.5042640Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5042856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5043183Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5044395Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5045513Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5046637Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5047757Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5048910Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5050020Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5050308Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:15:00.5050386Z Autotune Choices Stats: 2025-12-04T09:15:00.5051823Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5052369Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5052706Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5053266Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5054448Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5055607Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5056773Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5057962Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5059150Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5060321Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5061536Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5062878Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5064034Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5065188Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5065501Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:15:00.5065637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5065705Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5065771Z unimplemented [] 2025-12-04T09:15:00.5065870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5066053Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5067229Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5067341Z graph_break [] 2025-12-04T09:15:00.5067482Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5067551Z Autotune Choices Stats: 2025-12-04T09:15:00.5068950Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5069268Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5069541Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5069868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5071002Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5072136Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5073251Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5074403Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5075529Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5076728Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5076978Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:15:00.5077048Z Autotune Choices Stats: 2025-12-04T09:15:00.5078516Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.5078980Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5079308Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5079870Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5081050Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5082210Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5083404Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5084588Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5085751Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5086938Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5088115Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5089274Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5090421Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5091606Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5091852Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:15:00.5091986Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5092055Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5092123Z unimplemented [] 2025-12-04T09:15:00.5092229Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5092461Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5093699Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5093762Z graph_break [] 2025-12-04T09:15:00.5093895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5093962Z Autotune Choices Stats: 2025-12-04T09:15:00.5095383Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5095669Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5095885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5096210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5097338Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5098455Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5099587Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5100737Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5101906Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5103023Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5103273Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:15:00.5103372Z Autotune Choices Stats: 2025-12-04T09:15:00.5104832Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.5105267Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5105604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5106166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5107338Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5108525Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5109690Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5110888Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5112076Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5113266Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5114424Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5115585Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5116740Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5117926Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5118168Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:15:00.5118300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5118367Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5118479Z unimplemented [] 2025-12-04T09:15:00.5118584Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5118766Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5119954Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5120014Z graph_break [] 2025-12-04T09:15:00.5120183Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5120247Z Autotune Choices Stats: 2025-12-04T09:15:00.5121658Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5121904Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5122121Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5122444Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5123571Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5124704Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5125844Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5127000Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5128120Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5129267Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5129553Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:15:00.5129618Z Autotune Choices Stats: 2025-12-04T09:15:00.5131041Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.5131473Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5131802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5132402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5133591Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5134801Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5135991Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5137139Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5138381Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5139536Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5140689Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5141841Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5143148Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5144340Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5144588Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:15:00.5144719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5144787Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5144856Z unimplemented [] 2025-12-04T09:15:00.5144958Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5145141Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5146356Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5146449Z graph_break [] 2025-12-04T09:15:00.5146583Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5146649Z Autotune Choices Stats: 2025-12-04T09:15:00.5148048Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5148299Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5148514Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5148835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5149961Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5151127Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5152237Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5153392Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5154510Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5155684Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5155930Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:15:00.5155995Z Autotune Choices Stats: 2025-12-04T09:15:00.5157424Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5157854Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5158182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5158780Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5159951Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5161138Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5162444Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5163668Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5164900Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5166065Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5167217Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5168426Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5169574Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5170779Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5171024Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:15:00.5171157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5171226Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5171288Z unimplemented [] 2025-12-04T09:15:00.5171428Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5171610Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5172906Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5172971Z graph_break [] 2025-12-04T09:15:00.5173102Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5173169Z Autotune Choices Stats: 2025-12-04T09:15:00.5174553Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5174805Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5175021Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5175340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5176501Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5177620Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5178777Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5179896Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5181076Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5182188Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5182437Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:15:00.5182503Z Autotune Choices Stats: 2025-12-04T09:15:00.5183936Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:15:00.5184364Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5184731Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5185302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5186505Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5187665Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5188850Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5190030Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5191180Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5192328Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5193481Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5194665Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5195846Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5196995Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5197288Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:15:00.5197454Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5197523Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5197586Z unimplemented [] 2025-12-04T09:15:00.5197701Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5197885Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5199060Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5199122Z graph_break [] 2025-12-04T09:15:00.5199258Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5199325Z Autotune Choices Stats: 2025-12-04T09:15:00.5200710Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5200953Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5201209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5201527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5202652Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5203813Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5204933Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5206084Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5207234Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5208350Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5208597Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:15:00.5208667Z Autotune Choices Stats: 2025-12-04T09:15:00.5210091Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.5210620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5210943Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5211508Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5212750Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5213913Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5215143Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5216297Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5217454Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5218615Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5219801Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5220989Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5222139Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5223328Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5223603Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:15:00.5223736Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5223806Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5223866Z unimplemented [] 2025-12-04T09:15:00.5223982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5224165Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5225355Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5225419Z graph_break [] 2025-12-04T09:15:00.5225547Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5225621Z Autotune Choices Stats: 2025-12-04T09:15:00.5227014Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5227303Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5227518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5227842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5229023Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5230141Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5231289Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5232435Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5233556Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5234675Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5234925Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:15:00.5235028Z Autotune Choices Stats: 2025-12-04T09:15:00.5236450Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.5236891Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5237212Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5237822Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5238987Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5240225Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5241385Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5242766Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5243943Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5245163Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5246319Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5247508Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5248689Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5249871Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5250123Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:15:00.5250262Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5250333Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5250395Z unimplemented [] 2025-12-04T09:15:00.5250503Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5250687Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5251862Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5251921Z graph_break [] 2025-12-04T09:15:00.5252049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5252162Z Autotune Choices Stats: 2025-12-04T09:15:00.5253610Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5253860Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5254081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5254409Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5255572Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5256690Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5257866Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5258987Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5260101Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5261217Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5261520Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:15:00.5261587Z Autotune Choices Stats: 2025-12-04T09:15:00.5263166Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5263692Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5264024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5264592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5265818Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5267028Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5268189Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5269340Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5270549Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5271703Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5272890Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5274039Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5275253Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5276406Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5276657Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:15:00.5276799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5276872Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5276939Z unimplemented [] 2025-12-04T09:15:00.5277048Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5277231Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5278410Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5278507Z graph_break [] 2025-12-04T09:15:00.5278635Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5278713Z Autotune Choices Stats: 2025-12-04T09:15:00.5280097Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5280348Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5280599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5280926Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5282049Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5283233Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5284344Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5285464Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5286600Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5287756Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5288010Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:15:00.5288079Z Autotune Choices Stats: 2025-12-04T09:15:00.5289535Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5289977Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5290304Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5290880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5292138Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5293383Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5294550Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5295703Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5296897Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5298083Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5299242Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5300432Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5301608Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5302767Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5303010Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:15:00.5303146Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5303213Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5303273Z unimplemented [] 2025-12-04T09:15:00.5303382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5303600Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5304780Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5304842Z graph_break [] 2025-12-04T09:15:00.5304968Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5305037Z Autotune Choices Stats: 2025-12-04T09:15:00.5306460Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5306714Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5306930Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5307253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5308411Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5309565Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5310681Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5311813Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5312965Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5314083Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5314333Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:15:00.5314399Z Autotune Choices Stats: 2025-12-04T09:15:00.5315849Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:15:00.5316286Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5316674Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5317237Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5318401Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5319562Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5320712Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5321896Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5323079Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5324228Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5325428Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5326615Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5327763Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5328915Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5329159Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:15:00.5329328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5329400Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5329462Z unimplemented [] 2025-12-04T09:15:00.5329572Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5329755Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5330930Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5330992Z graph_break [] 2025-12-04T09:15:00.5331117Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5331202Z Autotune Choices Stats: 2025-12-04T09:15:00.5332679Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5332937Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5333154Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5333548Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5334674Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5335796Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5336920Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5338044Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5339185Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5340340Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5340584Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:15:00.5340659Z Autotune Choices Stats: 2025-12-04T09:15:00.5342078Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5342628Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5342951Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5343516Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5344681Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5345836Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5347026Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5348177Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5349367Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5350520Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5351738Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5352898Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5354046Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5355198Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5355476Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:15:00.5355619Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5355688Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5355753Z unimplemented [] 2025-12-04T09:15:00.5355867Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5356054Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5357284Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5357351Z graph_break [] 2025-12-04T09:15:00.5357481Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5357553Z Autotune Choices Stats: 2025-12-04T09:15:00.5358933Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5359245Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5359463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5359782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5360904Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5362225Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5363360Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5364550Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5365711Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5366841Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5367085Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:15:00.5367156Z Autotune Choices Stats: 2025-12-04T09:15:00.5368627Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5369124Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5369452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5370027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5371202Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5372403Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5373612Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5374795Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5375959Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5377170Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5378323Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5379472Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5380621Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5381815Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5382061Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:15:00.5382198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5382277Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5382345Z unimplemented [] 2025-12-04T09:15:00.5382455Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5382673Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5383846Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5383914Z graph_break [] 2025-12-04T09:15:00.5384042Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5384115Z Autotune Choices Stats: 2025-12-04T09:15:00.5385533Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5385824Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5386043Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5386366Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5387496Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5388611Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5389786Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5390902Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5392052Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5393171Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5393480Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:15:00.5393558Z Autotune Choices Stats: 2025-12-04T09:15:00.5394974Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5395413Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5395741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5396307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5397474Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5398679Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5399863Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5401022Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5402212Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5403388Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5404550Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5405701Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5406889Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5408039Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5408290Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:15:00.5408460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5408531Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5408594Z unimplemented [] 2025-12-04T09:15:00.5408703Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5408901Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5410077Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5410498Z graph_break [] 2025-12-04T09:15:00.5410667Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5410743Z Autotune Choices Stats: 2025-12-04T09:15:00.5412131Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5412423Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5412660Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5412978Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5414108Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5415268Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5416388Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5417544Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5418653Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5419865Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5420114Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:15:00.5420187Z Autotune Choices Stats: 2025-12-04T09:15:00.5421610Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5422055Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5422382Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5422943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5424147Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5425308Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5426492Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5427674Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5428858Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5430009Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5431163Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5432312Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5433505Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5434693Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5434941Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:15:00.5435069Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5435143Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5435206Z unimplemented [] 2025-12-04T09:15:00.5435311Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5435493Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5436731Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5436800Z graph_break [] 2025-12-04T09:15:00.5436925Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5436995Z Autotune Choices Stats: 2025-12-04T09:15:00.5438381Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5438630Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5438846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5439161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5440300Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5441451Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5442599Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5443728Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5444872Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5446020Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5446263Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:15:00.5446337Z Autotune Choices Stats: 2025-12-04T09:15:00.5447756Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:15:00.5448193Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5448517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5449123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5450305Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5451498Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5452690Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5453938Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5455102Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5456261Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5457418Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5458597Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5459784Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5460943Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5461189Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:15:00.5461318Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5461427Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5461489Z unimplemented [] 2025-12-04T09:15:00.5461631Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5461967Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5463162Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5463232Z graph_break [] 2025-12-04T09:15:00.5463363Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5463441Z Autotune Choices Stats: 2025-12-04T09:15:00.5464838Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5465085Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5465305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5465694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5466834Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5473262Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5474473Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5475667Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5476847Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5477975Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5478234Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:15:00.5478310Z Autotune Choices Stats: 2025-12-04T09:15:00.5479752Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5480236Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5480563Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5481125Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5482332Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5483507Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5484737Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5485922Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5487072Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5488208Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5489402Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5490534Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5491750Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5493037Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5493379Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:15:00.5493527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5493610Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5493674Z unimplemented [] 2025-12-04T09:15:00.5493781Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5493975Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5495158Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5495230Z graph_break [] 2025-12-04T09:15:00.5495362Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5495437Z Autotune Choices Stats: 2025-12-04T09:15:00.5496823Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5497114Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5497337Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5497656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5498781Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5499926Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5501035Z triton_flex_attention_801 0.0143 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5502202Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5503301Z triton_flex_attention_800 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5504409Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5504655Z SingleProcess AUTOTUNE benchmarking takes 0.2967 seconds and 1.3426 seconds precompiling for 6 choices 2025-12-04T09:15:00.5504736Z Autotune Choices Stats: 2025-12-04T09:15:00.5506165Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5506639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5506963Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5507565Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5508720Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5509908Z triton_flex_attention_backward_807 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5511092Z triton_flex_attention_backward_804 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5512236Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5513376Z triton_flex_attention_backward_809 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5514519Z triton_flex_attention_backward_810 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5515692Z triton_flex_attention_backward_813 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5516866Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5518010Z triton_flex_attention_backward_811 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5519207Z triton_flex_attention_backward_816 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5519462Z SingleProcess AUTOTUNE benchmarking takes 0.6691 seconds and 2.4016 seconds precompiling for 13 choices 2025-12-04T09:15:00.5519593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5519670Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5519735Z unimplemented [] 2025-12-04T09:15:00.5519837Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5520034Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5521215Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5521283Z graph_break [] 2025-12-04T09:15:00.5521411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5521478Z Autotune Choices Stats: 2025-12-04T09:15:00.5522868Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5523176Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5523401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5523715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5524876Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5525979Z triton_flex_attention_822 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5527164Z triton_flex_attention_819 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5528276Z triton_flex_attention_820 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5529381Z triton_flex_attention_817 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5530488Z triton_flex_attention_818 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5530763Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.3304 seconds precompiling for 6 choices 2025-12-04T09:15:00.5530838Z Autotune Choices Stats: 2025-12-04T09:15:00.5532263Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_826", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5532745Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5533111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5533677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5534838Z triton_flex_attention_backward_826 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5536051Z triton_flex_attention_backward_823 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5537194Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5538340Z triton_flex_attention_backward_825 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5539496Z triton_flex_attention_backward_828 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5540683Z triton_flex_attention_backward_829 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5541858Z triton_flex_attention_backward_830 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5543209Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5544411Z triton_flex_attention_backward_835 0.0173 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5545596Z triton_flex_attention_backward_831 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5545857Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.3188 seconds precompiling for 13 choices 2025-12-04T09:15:00.5545996Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5546081Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5546146Z unimplemented [] 2025-12-04T09:15:00.5546254Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5546449Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5547628Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5547735Z graph_break [] 2025-12-04T09:15:00.5547865Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5547936Z Autotune Choices Stats: 2025-12-04T09:15:00.5549343Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5549588Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5549816Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5550171Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5551311Z triton_flex_attention_840 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5552450Z triton_flex_attention_841 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5553619Z triton_flex_attention_838 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5554721Z triton_flex_attention_836 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5555825Z triton_flex_attention_839 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5556929Z triton_flex_attention_837 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5557223Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3626 seconds precompiling for 6 choices 2025-12-04T09:15:00.5557294Z Autotune Choices Stats: 2025-12-04T09:15:00.5558748Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_844", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5559188Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5559510Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5560076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5561302Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5562649Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5563827Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5564986Z triton_flex_attention_backward_843 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5566205Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5567352Z triton_flex_attention_backward_847 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5568547Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5569731Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5570920Z triton_flex_attention_backward_849 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5572056Z triton_flex_attention_backward_854 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5572358Z SingleProcess AUTOTUNE benchmarking takes 0.6694 seconds and 2.3460 seconds precompiling for 13 choices 2025-12-04T09:15:00.5572494Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5572571Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5572634Z unimplemented [] 2025-12-04T09:15:00.5572751Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5572944Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5574122Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5574230Z graph_break [] 2025-12-04T09:15:00.5574359Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5574424Z Autotune Choices Stats: 2025-12-04T09:15:00.5575851Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5576102Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5576330Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5576649Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5577806Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5578945Z triton_flex_attention_860 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5580056Z triton_flex_attention_855 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5581162Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5582266Z triton_flex_attention_858 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5583415Z triton_flex_attention_856 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5583660Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3732 seconds precompiling for 6 choices 2025-12-04T09:15:00.5583731Z Autotune Choices Stats: 2025-12-04T09:15:00.5585188Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5585634Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5585959Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5586602Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5587771Z triton_flex_attention_backward_862 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5588931Z triton_flex_attention_backward_864 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5590075Z triton_flex_attention_backward_861 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5591259Z triton_flex_attention_backward_863 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5592402Z triton_flex_attention_backward_866 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5593578Z triton_flex_attention_backward_865 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5594739Z triton_flex_attention_backward_867 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5595953Z triton_flex_attention_backward_868 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5597101Z triton_flex_attention_backward_870 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5598245Z triton_flex_attention_backward_873 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5598495Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.4237 seconds precompiling for 13 choices 2025-12-04T09:15:00.5598623Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5598730Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5598795Z unimplemented [] 2025-12-04T09:15:00.5598897Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5599093Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5600269Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5600341Z graph_break [] 2025-12-04T09:15:00.5600471Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5600539Z Autotune Choices Stats: 2025-12-04T09:15:00.5601972Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_878", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5602213Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5602441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5602754Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5603951Z triton_flex_attention_878 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5605053Z triton_flex_attention_879 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5606165Z triton_flex_attention_874 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5607268Z triton_flex_attention_876 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5608415Z triton_flex_attention_877 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5609530Z triton_flex_attention_875 0.0145 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5609810Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.4549 seconds precompiling for 6 choices 2025-12-04T09:15:00.5609879Z Autotune Choices Stats: 2025-12-04T09:15:00.5611310Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5611791Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5612149Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5612769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5613939Z triton_flex_attention_backward_881 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5615103Z triton_flex_attention_backward_880 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5616245Z triton_flex_attention_backward_882 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5617428Z triton_flex_attention_backward_883 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5618666Z triton_flex_attention_backward_885 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5619825Z triton_flex_attention_backward_884 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5621001Z triton_flex_attention_backward_886 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5622175Z triton_flex_attention_backward_887 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5623322Z triton_flex_attention_backward_889 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5624462Z triton_flex_attention_backward_892 0.0172 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5624751Z SingleProcess AUTOTUNE benchmarking takes 0.6678 seconds and 2.4344 seconds precompiling for 13 choices 2025-12-04T09:15:00.5624923Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:15:00.5625011Z Traceback (most recent call last): 2025-12-04T09:15:00.5625312Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:15:00.5625377Z self.assertTrue( 2025-12-04T09:15:00.5625598Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:15:00.5625680Z raise self.failureException(msg) 2025-12-04T09:15:00.5625921Z AssertionError: False is not true : Log file /tmp/tmp2i0vhuu1/flex_attention_configs.json was not created 2025-12-04T09:15:00.5625931Z 2025-12-04T09:15:00.5626067Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.5626325Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.5626329Z 2025-12-04T09:15:00.5626535Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.5626669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5626739Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5626809Z unimplemented [] 2025-12-04T09:15:00.5626914Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5628102Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:15:00.5628324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5628415Z graph_break [] 2025-12-04T09:15:00.5628552Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5629551Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:15:00.5629652Z current_size = base.storage().size() 2025-12-04T09:15:00.5629722Z Autotune Choices Stats: 2025-12-04T09:15:00.5631121Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.5631369Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5631596Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5631915Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5633080Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5634174Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5635310Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5636453Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5637630Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5638739Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5638991Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:15:00.5639062Z Autotune Choices Stats: 2025-12-04T09:15:00.5640515Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.5640951Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5641325Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5641888Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5643049Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5644234Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5645401Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5646579Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5647720Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5648867Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5650005Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5651197Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5652467Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5653635Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5653899Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:15:00.5654068Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5654181Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5654251Z unimplemented [] 2025-12-04T09:15:00.5654357Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5654550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5655724Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5655798Z graph_break [] 2025-12-04T09:15:00.5655927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5655999Z Autotune Choices Stats: 2025-12-04T09:15:00.5657395Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5657638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5657901Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5658222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5659349Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5660495Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5661604Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5662924Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5664091Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5665199Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5665449Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:15:00.5665517Z Autotune Choices Stats: 2025-12-04T09:15:00.5666947Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5667444Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5667776Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5668341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5669549Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5670692Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5671897Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5673041Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5674180Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5675321Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5676493Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5677668Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5678813Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5679986Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5680300Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:15:00.5680430Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5680499Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5680568Z unimplemented [] 2025-12-04T09:15:00.5680670Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5680862Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5682031Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5682102Z graph_break [] 2025-12-04T09:15:00.5682229Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5682295Z Autotune Choices Stats: 2025-12-04T09:15:00.5683685Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5683964Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5684184Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5684501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5685656Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5686762Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5687912Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5689047Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5690157Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5691264Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5691505Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:15:00.5691572Z Autotune Choices Stats: 2025-12-04T09:15:00.5693116Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5693557Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5693890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5694493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5695662Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5696836Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5698016Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5699156Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5700293Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5701469Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5702604Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5703781Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5704925Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5706130Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5706380Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:15:00.5706510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5706580Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5706653Z unimplemented [] 2025-12-04T09:15:00.5706752Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5706943Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5708122Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5708184Z graph_break [] 2025-12-04T09:15:00.5708327Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5708395Z Autotune Choices Stats: 2025-12-04T09:15:00.5709823Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5710068Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5710288Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5710605Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5711755Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5712864Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5714057Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5715156Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5716261Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5717359Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5717644Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:15:00.5717713Z Autotune Choices Stats: 2025-12-04T09:15:00.5719136Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5719569Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5719935Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5720499Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5721701Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5722884Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5724031Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5725180Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5726334Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5727506Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5728670Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5729822Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5731030Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5732166Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5732461Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:15:00.5732592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5732662Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5732729Z unimplemented [] 2025-12-04T09:15:00.5732828Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5733010Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5734183Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5734286Z graph_break [] 2025-12-04T09:15:00.5734422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5734488Z Autotune Choices Stats: 2025-12-04T09:15:00.5735865Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5736106Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5736361Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5736677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5737796Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5738929Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5740088Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5741198Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5742304Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5743441Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5743689Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:15:00.5743756Z Autotune Choices Stats: 2025-12-04T09:15:00.5745221Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5745653Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5745978Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5746542Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5747765Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5748910Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5750053Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5751195Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5752408Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5753579Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5754718Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5755888Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5757062Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5758197Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5758455Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:15:00.5758588Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5758657Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5758727Z unimplemented [] 2025-12-04T09:15:00.5758827Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5759011Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5760232Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5760296Z graph_break [] 2025-12-04T09:15:00.5760431Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5760498Z Autotune Choices Stats: 2025-12-04T09:15:00.5762098Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.5762364Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5762596Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5762911Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5764110Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5765282Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5766399Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5767508Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5768614Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5769764Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5770016Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:15:00.5770087Z Autotune Choices Stats: 2025-12-04T09:15:00.5771559Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5772000Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5772375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5773025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5774193Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5775346Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5776503Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5777714Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5778856Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5780048Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5781203Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5782420Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5783565Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5784716Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5784966Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:15:00.5785101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5785209Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5785281Z unimplemented [] 2025-12-04T09:15:00.5785390Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5785575Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5786754Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5786814Z graph_break [] 2025-12-04T09:15:00.5786948Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5787020Z Autotune Choices Stats: 2025-12-04T09:15:00.5788445Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5788689Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5788911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5789261Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5790410Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5791518Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5792630Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5793738Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5794898Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5796039Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5796288Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:15:00.5796354Z Autotune Choices Stats: 2025-12-04T09:15:00.5797781Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.5798280Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5798611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5799172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5800336Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5801481Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5802627Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5803809Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5804984Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5806139Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5807370Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5808515Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5809667Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5810813Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5811099Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:15:00.5811227Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5811297Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5811365Z unimplemented [] 2025-12-04T09:15:00.5811466Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5811648Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5812906Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5812973Z graph_break [] 2025-12-04T09:15:00.5813118Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5813188Z Autotune Choices Stats: 2025-12-04T09:15:00.5814580Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.5814860Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5815115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5815430Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5816552Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5817670Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5818787Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5819942Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5821048Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5822189Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5822435Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:15:00.5822506Z Autotune Choices Stats: 2025-12-04T09:15:00.5823965Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5824433Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5824768Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5825336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5826502Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5827652Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5828833Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5830016Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5831165Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5832342Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5833511Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5834661Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5835804Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5837005Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5837255Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:15:00.5837385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5837454Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5837524Z unimplemented [] 2025-12-04T09:15:00.5837626Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5837812Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5839023Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5839087Z graph_break [] 2025-12-04T09:15:00.5839219Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5839288Z Autotune Choices Stats: 2025-12-04T09:15:00.5840733Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5841005Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5841220Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5841538Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5842654Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5843766Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5844913Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5846018Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5847159Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5848261Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5848541Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:15:00.5848607Z Autotune Choices Stats: 2025-12-04T09:15:00.5850077Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5850509Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5850842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5851407Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5852612Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5853797Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5854942Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5856118Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5857580Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5858769Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5859923Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5861070Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5862363Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5863592Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5863838Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:15:00.5863979Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5864111Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5864182Z unimplemented [] 2025-12-04T09:15:00.5864285Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5864471Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5865662Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5865773Z graph_break [] 2025-12-04T09:15:00.5865908Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5865974Z Autotune Choices Stats: 2025-12-04T09:15:00.5867411Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5867654Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5867873Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5868212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5869343Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5870454Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5871598Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5872753Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5873869Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5875005Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5875285Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:15:00.5875352Z Autotune Choices Stats: 2025-12-04T09:15:00.5876785Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5877217Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5877544Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5878102Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5879302Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5880442Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5881622Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5882763Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5884017Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5885162Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5886308Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5887452Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5888630Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5889817Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5890062Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:15:00.5890199Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5890270Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5890341Z unimplemented [] 2025-12-04T09:15:00.5890441Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5890628Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5891844Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5891952Z graph_break [] 2025-12-04T09:15:00.5892087Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5892157Z Autotune Choices Stats: 2025-12-04T09:15:00.5893604Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.5893863Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5894078Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5894399Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5895518Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5896673Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5897815Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5898928Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5900082Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5901219Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5901470Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:15:00.5901537Z Autotune Choices Stats: 2025-12-04T09:15:00.5902963Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5903396Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5903723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5904343Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5905510Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5906685Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5907834Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5909009Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5910199Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5911349Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5912484Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5913664Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5914802Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5915984Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5916229Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:15:00.5916361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5916428Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5916529Z unimplemented [] 2025-12-04T09:15:00.5916629Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5916844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5918029Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5918092Z graph_break [] 2025-12-04T09:15:00.5918221Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5918288Z Autotune Choices Stats: 2025-12-04T09:15:00.5919681Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:15:00.5919927Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5920140Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5920458Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5921622Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5922730Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5923867Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5924980Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5926142Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5927244Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5927493Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:15:00.5927560Z Autotune Choices Stats: 2025-12-04T09:15:00.5928984Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5929448Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5929776Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5930338Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5931534Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5932738Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5933932Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5935128Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5936279Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5937433Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5938576Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5939810Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5941002Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5942162Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5942445Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:15:00.5942610Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5942681Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5942751Z unimplemented [] 2025-12-04T09:15:00.5942864Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5943053Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5944230Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5944293Z graph_break [] 2025-12-04T09:15:00.5944435Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5944501Z Autotune Choices Stats: 2025-12-04T09:15:00.5945885Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5946162Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5946390Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5946705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5947826Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5948964Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5950075Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5951249Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5952362Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5953473Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5953720Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:15:00.5953788Z Autotune Choices Stats: 2025-12-04T09:15:00.5955228Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.5955693Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5956024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5956585Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5957790Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5958953Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5960199Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5961356Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5962709Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5963875Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5965092Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5966290Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5967442Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5968691Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5968988Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:15:00.5969116Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5969187Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5969254Z unimplemented [] 2025-12-04T09:15:00.5969355Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5969546Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5970745Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5970807Z graph_break [] 2025-12-04T09:15:00.5970940Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5971006Z Autotune Choices Stats: 2025-12-04T09:15:00.5972444Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5972742Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5972964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5973280Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5974449Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5975559Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5976697Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.5977831Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5978945Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.5980058Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5980304Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:15:00.5980406Z Autotune Choices Stats: 2025-12-04T09:15:00.5981839Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.5982269Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5982628Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5983192Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.5984355Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5985573Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5986729Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5987894Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5989040Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5990222Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.5991399Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.5992553Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.5993727Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.5994902Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.5995149Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:15:00.5995281Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.5995352Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.5995420Z unimplemented [] 2025-12-04T09:15:00.5995523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.5995707Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.5996884Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.5996945Z graph_break [] 2025-12-04T09:15:00.5997079Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.5997194Z Autotune Choices Stats: 2025-12-04T09:15:00.5998589Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.5998833Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.5999056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.5999377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6000556Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6001701Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6002842Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6003945Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6005062Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6006169Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6006452Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:15:00.6006519Z Autotune Choices Stats: 2025-12-04T09:15:00.6007942Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.6008410Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6008745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6009312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6010509Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6011687Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6012904Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6014060Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6015255Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6016406Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6017586Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6019114Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6020395Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6021550Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6021818Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:15:00.6021956Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6022029Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6022097Z unimplemented [] 2025-12-04T09:15:00.6022202Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6022387Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6023568Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6023667Z graph_break [] 2025-12-04T09:15:00.6023805Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6023873Z Autotune Choices Stats: 2025-12-04T09:15:00.6025273Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6025562Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6025788Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6026104Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6027239Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6028411Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6029531Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6030643Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6031753Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6032910Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6033176Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:15:00.6033242Z Autotune Choices Stats: 2025-12-04T09:15:00.6034698Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:15:00.6035133Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6035469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6036096Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6037262Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6038410Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6039568Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6040712Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6041895Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6043076Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6044224Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6045433Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6046570Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6047722Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6047973Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:15:00.6048102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6048172Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6048242Z unimplemented [] 2025-12-04T09:15:00.6048345Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6048568Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6049757Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6049818Z graph_break [] 2025-12-04T09:15:00.6049961Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6050030Z Autotune Choices Stats: 2025-12-04T09:15:00.6051453Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6051701Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6051925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6052242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6053491Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6054634Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6055757Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6056862Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6058015Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6059122Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6059374Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:15:00.6059476Z Autotune Choices Stats: 2025-12-04T09:15:00.6060904Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6061334Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6061748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6062515Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6063688Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6064840Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6065998Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6067214Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6068415Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6069577Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6070762Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6071963Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6073121Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6074279Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6074569Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:15:00.6074706Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6074780Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6074847Z unimplemented [] 2025-12-04T09:15:00.6074953Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6075140Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6076322Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6076389Z graph_break [] 2025-12-04T09:15:00.6076525Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6076628Z Autotune Choices Stats: 2025-12-04T09:15:00.6078026Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.6078270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6078522Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6078877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6080011Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6081132Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6082253Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6083363Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6084508Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6085650Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6085903Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:15:00.6085971Z Autotune Choices Stats: 2025-12-04T09:15:00.6087447Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6087916Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6088246Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6088810Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6089993Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6091139Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6092434Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6093593Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6094787Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6095942Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6097169Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6098316Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6099458Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6100605Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6100898Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:15:00.6101046Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6101119Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6101187Z unimplemented [] 2025-12-04T09:15:00.6101289Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6101475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6102688Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6102752Z graph_break [] 2025-12-04T09:15:00.6102883Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6102951Z Autotune Choices Stats: 2025-12-04T09:15:00.6104373Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6104648Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6104864Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6105186Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6106310Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6107419Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6108532Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6109678Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6110826Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6111937Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6112185Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:15:00.6112262Z Autotune Choices Stats: 2025-12-04T09:15:00.6113759Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6114190Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6114523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6115092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6116259Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6117403Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6118590Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6119779Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6120925Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6122147Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6123291Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6124441Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6125580Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6126765Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6127007Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:15:00.6127144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6127215Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6127288Z unimplemented [] 2025-12-04T09:15:00.6127391Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6127627Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6128808Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6128870Z graph_break [] 2025-12-04T09:15:00.6129002Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6129069Z Autotune Choices Stats: 2025-12-04T09:15:00.6130522Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6130771Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6130987Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6131307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6132481Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6133641Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6134807Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6135931Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6137087Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6138197Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6138524Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:15:00.6138598Z Autotune Choices Stats: 2025-12-04T09:15:00.6140045Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6140483Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6140814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6141381Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6142548Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6143732Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6144921Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6146069Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6147259Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6148447Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6149619Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6150770Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6151962Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6153114Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6153445Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:15:00.6153582Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6153653Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6153726Z unimplemented [] 2025-12-04T09:15:00.6153828Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6154014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6155196Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6155294Z graph_break [] 2025-12-04T09:15:00.6155459Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6155526Z Autotune Choices Stats: 2025-12-04T09:15:00.6156913Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6157161Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6157381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6157700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6158833Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6160016Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6161123Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6162444Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6163611Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6164836Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6165094Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:15:00.6165160Z Autotune Choices Stats: 2025-12-04T09:15:00.6166604Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.6167040Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6167388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6167979Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6169228Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6170443Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6171610Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6172872Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6174056Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6175217Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6176378Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6177548Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6178946Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6180151Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6180415Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:15:00.6180551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6180622Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6180697Z unimplemented [] 2025-12-04T09:15:00.6180801Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6180984Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6182253Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6182318Z graph_break [] 2025-12-04T09:15:00.6182453Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6182523Z Autotune Choices Stats: 2025-12-04T09:15:00.6183931Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6184178Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6184405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6184722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6185855Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6187015Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6188163Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6189287Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6190438Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6191602Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6191853Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:15:00.6191924Z Autotune Choices Stats: 2025-12-04T09:15:00.6193361Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6193836Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6194211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6194782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6195955Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6197146Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6198303Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6199523Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6200669Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6201835Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6202990Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6204176Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6205370Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6206527Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6206781Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:15:00.6206948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6207019Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6207091Z unimplemented [] 2025-12-04T09:15:00.6207228Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6207416Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6208641Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6208702Z graph_break [] 2025-12-04T09:15:00.6208840Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6208909Z Autotune Choices Stats: 2025-12-04T09:15:00.6210309Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6210552Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6210776Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6211137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6212277Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6213526Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6214659Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6215808Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6216961Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6218089Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6218346Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:15:00.6218412Z Autotune Choices Stats: 2025-12-04T09:15:00.6219843Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:15:00.6220349Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6220679Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6221242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6222440Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6223594Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6224828Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6225972Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6227132Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6228284Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6229485Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6230678Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6231870Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6233071Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6233361Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:15:00.6233490Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6233559Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6233627Z unimplemented [] 2025-12-04T09:15:00.6233729Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6233912Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6235098Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6235163Z graph_break [] 2025-12-04T09:15:00.6235298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6235364Z Autotune Choices Stats: 2025-12-04T09:15:00.6238443Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6238820Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6239097Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6239435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6240577Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6241755Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6242865Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6244017Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6245137Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6246249Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6246495Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:15:00.6246574Z Autotune Choices Stats: 2025-12-04T09:15:00.6248079Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6248564Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6248894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6249501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6250672Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6251832Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6253119Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6254271Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6255420Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6256668Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6257817Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6258990Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6260143Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6261324Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6261571Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:15:00.6261711Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6261988Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6262064Z unimplemented [] 2025-12-04T09:15:00.6262189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6262386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6263574Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6263643Z graph_break [] 2025-12-04T09:15:00.6263778Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6263852Z Autotune Choices Stats: 2025-12-04T09:15:00.6265366Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.6265677Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6265899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6266224Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6267403Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6268521Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6269688Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6270795Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6271901Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6273012Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6273326Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:15:00.6273402Z Autotune Choices Stats: 2025-12-04T09:15:00.6274821Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6275257Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6275622Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6276190Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6277354Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6278583Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6279729Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6280879Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6282086Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6283266Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6284446Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6285586Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6286734Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6287915Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6288160Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:15:00.6288296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6288371Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6288432Z unimplemented [] 2025-12-04T09:15:00.6288539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6288722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6289904Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6290074Z graph_break [] 2025-12-04T09:15:00.6290262Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6290342Z Autotune Choices Stats: 2025-12-04T09:15:00.6291734Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.6291985Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6292240Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6292623Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6293754Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6294873Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6296024Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6297134Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6298234Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6299390Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6299670Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:15:00.6299745Z Autotune Choices Stats: 2025-12-04T09:15:00.6301498Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6301945Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6302270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6302837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6304035Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6305194Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6306341Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6307526Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6308720Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6309903Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6311048Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6312191Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6313374Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6314521Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6314772Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:15:00.6314905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6314985Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6315051Z unimplemented [] 2025-12-04T09:15:00.6315163Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6315349Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6316563Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6316665Z graph_break [] 2025-12-04T09:15:00.6316795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6316868Z Autotune Choices Stats: 2025-12-04T09:15:00.6318292Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6318547Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6318765Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6319082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6320221Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6321427Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6322534Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6323644Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6324785Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6325949Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6326192Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:15:00.6326267Z Autotune Choices Stats: 2025-12-04T09:15:00.6327719Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.6328158Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6328483Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6329084Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6330248Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6331408Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6332659Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6333931Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6335082Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6336403Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6337557Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6338732Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6339888Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6341033Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6341280Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:15:00.6341409Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6341518Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6341580Z unimplemented [] 2025-12-04T09:15:00.6341723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6341911Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6343087Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6343151Z graph_break [] 2025-12-04T09:15:00.6343277Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6343350Z Autotune Choices Stats: 2025-12-04T09:15:00.6344776Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6345023Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6345249Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6345604Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6346731Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6347841Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6348950Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6350101Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6351237Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6352393Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6352639Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:15:00.6352712Z Autotune Choices Stats: 2025-12-04T09:15:00.6354131Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.6354602Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6354921Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6355486Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6356644Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6357801Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6358981Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6360206Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6361382Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6362701Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6363926Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6365068Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6366218Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6367420Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6367734Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:15:00.6367867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6367943Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6368007Z unimplemented [] 2025-12-04T09:15:00.6368149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6368347Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6369524Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6369644Z graph_break [] 2025-12-04T09:15:00.6369777Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6369850Z Autotune Choices Stats: 2025-12-04T09:15:00.6371238Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6371523Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6371742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6372057Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6373252Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6374367Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6375473Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6376659Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6377761Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6378938Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6379181Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:15:00.6379257Z Autotune Choices Stats: 2025-12-04T09:15:00.6380675Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.6381149Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6381470Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6382036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6383194Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6384407Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6385611Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6386792Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6387941Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6389080Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6390261Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6391402Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6392556Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6393773Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6394022Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:15:00.6394152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6394229Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6394295Z unimplemented [] 2025-12-04T09:15:00.6394396Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6394590Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6395836Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6395905Z graph_break [] 2025-12-04T09:15:00.6396032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6396110Z Autotune Choices Stats: 2025-12-04T09:15:00.6397499Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6397797Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6398012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6398324Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6399453Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6400547Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6401692Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6402835Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6403969Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6405079Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6405353Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:15:00.6405424Z Autotune Choices Stats: 2025-12-04T09:15:00.6406893Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6407332Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6407663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6408230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6409403Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6410646Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6411796Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6413038Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6414207Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6415383Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6416534Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6417687Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6418925Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6420111Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6420361Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:15:00.6420494Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6420570Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6420664Z unimplemented [] 2025-12-04T09:15:00.6420769Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6420972Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6422148Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6422215Z graph_break [] 2025-12-04T09:15:00.6422417Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6422490Z Autotune Choices Stats: 2025-12-04T09:15:00.6423871Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6424117Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6424334Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6424656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6425817Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6426961Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6428113Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6429264Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6430371Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6431489Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6431768Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:15:00.6431839Z Autotune Choices Stats: 2025-12-04T09:15:00.6433267Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:15:00.6433707Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6434029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6434629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6435858Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6437020Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6438197Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6439353Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6440534Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6441693Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6442853Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6444038Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6445225Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6446438Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6446691Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:15:00.6446819Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6446895Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6453060Z unimplemented [] 2025-12-04T09:15:00.6453230Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6453449Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6454660Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6454815Z graph_break [] 2025-12-04T09:15:00.6454960Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6455033Z Autotune Choices Stats: 2025-12-04T09:15:00.6456452Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6456711Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6456938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6457258Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6458442Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6459603Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6460716Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6462149Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6463351Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6464540Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6464807Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:15:00.6464887Z Autotune Choices Stats: 2025-12-04T09:15:00.6466327Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.6466828Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6467161Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6467837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6468999Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6470194Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6471348Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6472503Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6473685Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6474830Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6475974Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6477237Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6478380Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6479561Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6479817Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:15:00.6479961Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6480040Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6480107Z unimplemented [] 2025-12-04T09:15:00.6480254Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6480447Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6481635Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6481697Z graph_break [] 2025-12-04T09:15:00.6481830Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6481905Z Autotune Choices Stats: 2025-12-04T09:15:00.6483312Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6483567Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6483790Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6484118Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6485326Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6486435Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6487574Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6488679Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6489826Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6490935Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6491187Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:15:00.6491261Z Autotune Choices Stats: 2025-12-04T09:15:00.6492782Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.6493294Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6493625Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6494227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6495432Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6496591Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6497742Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6498919Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6500078Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6501220Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6502437Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6503607Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6504791Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6505939Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6506225Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:15:00.6506362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6506434Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6506498Z unimplemented [] 2025-12-04T09:15:00.6506609Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6506844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6508021Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6508097Z graph_break [] 2025-12-04T09:15:00.6508228Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6508304Z Autotune Choices Stats: 2025-12-04T09:15:00.6509693Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6509943Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6510238Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6510563Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6511677Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6512822Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6513930Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6515043Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6516181Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6517296Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6517539Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:15:00.6517613Z Autotune Choices Stats: 2025-12-04T09:15:00.6519085Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6519565Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6519891Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6520455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6521655Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6522811Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6523992Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6525143Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6526294Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6527508Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6528701Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6529940Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6531094Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6532239Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6532607Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:15:00.6532744Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6532813Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6532875Z unimplemented [] 2025-12-04T09:15:00.6532984Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6533173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6534358Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6534430Z graph_break [] 2025-12-04T09:15:00.6534565Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6534643Z Autotune Choices Stats: 2025-12-04T09:15:00.6536074Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6536361Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6536580Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6536900Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6538058Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6539174Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6540303Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6541445Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6542557Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6543672Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6543918Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:15:00.6544044Z Autotune Choices Stats: 2025-12-04T09:15:00.6545512Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6545954Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6546313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6546881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6548048Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6549260Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6550398Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6551547Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6552694Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6553910Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6555060Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6556236Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6557388Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6558571Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6558820Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:15:00.6558952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6559034Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6559099Z unimplemented [] 2025-12-04T09:15:00.6559211Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6559399Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6560584Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6560653Z graph_break [] 2025-12-04T09:15:00.6560782Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6560895Z Autotune Choices Stats: 2025-12-04T09:15:00.6562502Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6562776Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6562995Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6563320Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6564499Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6565619Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6566781Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6567901Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6569016Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6570175Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6570467Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:15:00.6570544Z Autotune Choices Stats: 2025-12-04T09:15:00.6571967Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:15:00.6572491Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6572834Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6573406Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6574569Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6575755Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6576897Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6578047Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6579265Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6580422Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6581627Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6582773Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6583958Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6585104Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6585352Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:15:00.6585481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6585560Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6585626Z unimplemented [] 2025-12-04T09:15:00.6585738Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6585925Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6587135Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6587236Z graph_break [] 2025-12-04T09:15:00.6587364Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6587445Z Autotune Choices Stats: 2025-12-04T09:15:00.6588836Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6589125Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6589350Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6589663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6590795Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6591949Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6593064Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6594178Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6595286Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6596480Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6596730Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:15:00.6596805Z Autotune Choices Stats: 2025-12-04T09:15:00.6598267Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6598713Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6599035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6599645Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6600802Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6601954Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6603099Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6604278Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6605464Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6606637Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6607791Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6608933Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6610121Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6611286Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6611529Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:15:00.6611658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6611733Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6611795Z unimplemented [] 2025-12-04T09:15:00.6611904Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6612128Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6613398Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6613469Z graph_break [] 2025-12-04T09:15:00.6613599Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6613675Z Autotune Choices Stats: 2025-12-04T09:15:00.6615106Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6615359Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6615574Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6615887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6617022Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6618165Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6619281Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6620397Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6621577Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6622694Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6622937Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:15:00.6623010Z Autotune Choices Stats: 2025-12-04T09:15:00.6624469Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6624911Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6625272Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6625836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6627000Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6628157Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6629333Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6630516Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6631704Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6632849Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6634014Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6635203Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6636352Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6637502Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6637745Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:15:00.6637940Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6638020Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6638084Z unimplemented [] 2025-12-04T09:15:00.6638187Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6638377Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6639554Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6639624Z graph_break [] 2025-12-04T09:15:00.6639750Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6639824Z Autotune Choices Stats: 2025-12-04T09:15:00.6641243Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6641498Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6641719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6642070Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6643204Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6644323Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6645445Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6646615Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6647759Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6648910Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6649154Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:15:00.6649225Z Autotune Choices Stats: 2025-12-04T09:15:00.6650652Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6651122Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6651445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6652008Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6653240Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6654387Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6655625Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6656777Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6657956Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6659099Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6660288Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6661430Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6662760Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6663989Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6664373Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:15:00.6664511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6664584Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6664648Z unimplemented [] 2025-12-04T09:15:00.6664752Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6664942Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6666170Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6666244Z graph_break [] 2025-12-04T09:15:00.6666373Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6666448Z Autotune Choices Stats: 2025-12-04T09:15:00.6667841Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6668139Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6668360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6668675Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6669808Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6670927Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6672080Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6673246Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6674395Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6675513Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6675756Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:15:00.6675830Z Autotune Choices Stats: 2025-12-04T09:15:00.6677256Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6677747Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6678082Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6678659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6679821Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6681011Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6682205Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6683395Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6684541Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6685728Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6686881Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6688030Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6689217Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6690408Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6690662Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:15:00.6690791Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6690868Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6690934Z unimplemented [] 2025-12-04T09:15:00.6691036Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6691263Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6692483Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6692552Z graph_break [] 2025-12-04T09:15:00.6692686Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6692756Z Autotune Choices Stats: 2025-12-04T09:15:00.6694156Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6694443Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6694666Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6694985Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6696131Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6697239Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6698424Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6699547Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6700694Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6701811Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6702091Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:15:00.6702167Z Autotune Choices Stats: 2025-12-04T09:15:00.6703590Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:15:00.6704032Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6704361Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6704933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6706135Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6707339Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6708534Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6709692Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6710837Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6712028Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6713184Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6714332Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6715558Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6716705Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6716996Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:15:00.6717126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6717199Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6717262Z unimplemented [] 2025-12-04T09:15:00.6717365Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6717554Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6718733Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6718843Z graph_break [] 2025-12-04T09:15:00.6718972Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6719038Z Autotune Choices Stats: 2025-12-04T09:15:00.6720430Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6720673Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6720899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6721215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6722344Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6723534Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6724653Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6725803Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6726912Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6728065Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6728305Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:15:00.6728377Z Autotune Choices Stats: 2025-12-04T09:15:00.6729810Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6730251Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6730576Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6731141Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6732476Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6733651Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6734835Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6735991Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6737175Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6738337Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6739492Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6740677Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6741878Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6743060Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6743317Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:15:00.6743445Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6743521Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6743585Z unimplemented [] 2025-12-04T09:15:00.6743728Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6743921Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6745168Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6745235Z graph_break [] 2025-12-04T09:15:00.6745361Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6745428Z Autotune Choices Stats: 2025-12-04T09:15:00.6746825Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6747073Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6747296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6747612Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6748773Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6749922Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6751357Z triton_flex_attention_801 0.0143 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6752491Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6753607Z triton_flex_attention_800 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6754765Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6755009Z SingleProcess AUTOTUNE benchmarking takes 0.2967 seconds and 1.3426 seconds precompiling for 6 choices 2025-12-04T09:15:00.6755089Z Autotune Choices Stats: 2025-12-04T09:15:00.6756509Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6756951Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6757346Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6757914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6759077Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6760287Z triton_flex_attention_backward_807 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6761432Z triton_flex_attention_backward_804 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6762816Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6763970Z triton_flex_attention_backward_809 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6765131Z triton_flex_attention_backward_810 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6766355Z triton_flex_attention_backward_813 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6767552Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6768763Z triton_flex_attention_backward_811 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6769923Z triton_flex_attention_backward_816 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6770174Z SingleProcess AUTOTUNE benchmarking takes 0.6691 seconds and 2.4016 seconds precompiling for 13 choices 2025-12-04T09:15:00.6770306Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6770435Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6770499Z unimplemented [] 2025-12-04T09:15:00.6770602Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6770790Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6771978Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6772045Z graph_break [] 2025-12-04T09:15:00.6772172Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6772242Z Autotune Choices Stats: 2025-12-04T09:15:00.6773698Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6773939Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6774162Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6774576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6775707Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6776870Z triton_flex_attention_822 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6778032Z triton_flex_attention_819 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6779149Z triton_flex_attention_820 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6780314Z triton_flex_attention_817 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6781436Z triton_flex_attention_818 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6781680Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.3304 seconds precompiling for 6 choices 2025-12-04T09:15:00.6781752Z Autotune Choices Stats: 2025-12-04T09:15:00.6783177Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_826", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6783685Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6784007Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6784569Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6785772Z triton_flex_attention_backward_826 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6786928Z triton_flex_attention_backward_823 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6788079Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6789264Z triton_flex_attention_backward_825 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6790411Z triton_flex_attention_backward_828 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6791562Z triton_flex_attention_backward_829 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6792815Z triton_flex_attention_backward_830 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6793963Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6795149Z triton_flex_attention_backward_835 0.0173 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6796298Z triton_flex_attention_backward_831 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6796581Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.3188 seconds precompiling for 13 choices 2025-12-04T09:15:00.6796707Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6796781Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6796845Z unimplemented [] 2025-12-04T09:15:00.6796949Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6797136Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6798310Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6798394Z graph_break [] 2025-12-04T09:15:00.6798526Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6798594Z Autotune Choices Stats: 2025-12-04T09:15:00.6799993Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6800398Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6800626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6800942Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6802068Z triton_flex_attention_840 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6803217Z triton_flex_attention_841 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6804336Z triton_flex_attention_838 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6805496Z triton_flex_attention_836 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6806608Z triton_flex_attention_839 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6807725Z triton_flex_attention_837 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6807970Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3626 seconds precompiling for 6 choices 2025-12-04T09:15:00.6808038Z Autotune Choices Stats: 2025-12-04T09:15:00.6809510Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_844", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6809984Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6810308Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6810907Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6812073Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6813326Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6814538Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6815705Z triton_flex_attention_backward_843 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6816856Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6818043Z triton_flex_attention_backward_847 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6819229Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6820417Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6821570Z triton_flex_attention_backward_849 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6822751Z triton_flex_attention_backward_854 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6823001Z SingleProcess AUTOTUNE benchmarking takes 0.6694 seconds and 2.3460 seconds precompiling for 13 choices 2025-12-04T09:15:00.6823134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6823212Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6823276Z unimplemented [] 2025-12-04T09:15:00.6823380Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6823577Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6824754Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6824821Z graph_break [] 2025-12-04T09:15:00.6824959Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6825029Z Autotune Choices Stats: 2025-12-04T09:15:00.6826463Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6826743Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6826973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6827290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6828457Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6829617Z triton_flex_attention_860 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6830772Z triton_flex_attention_855 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6831884Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6833002Z triton_flex_attention_858 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6834114Z triton_flex_attention_856 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6834436Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3732 seconds precompiling for 6 choices 2025-12-04T09:15:00.6834512Z Autotune Choices Stats: 2025-12-04T09:15:00.6835945Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6836377Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6836760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6837322Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6838495Z triton_flex_attention_backward_862 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6839689Z triton_flex_attention_backward_864 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6840839Z triton_flex_attention_backward_861 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6841989Z triton_flex_attention_backward_863 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6843164Z triton_flex_attention_backward_866 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6844340Z triton_flex_attention_backward_865 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6845518Z triton_flex_attention_backward_867 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6846671Z triton_flex_attention_backward_868 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6847821Z triton_flex_attention_backward_870 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6849001Z triton_flex_attention_backward_873 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6849251Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.4237 seconds precompiling for 13 choices 2025-12-04T09:15:00.6849393Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6849468Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6849530Z unimplemented [] 2025-12-04T09:15:00.6849631Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6849823Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6850998Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6851104Z graph_break [] 2025-12-04T09:15:00.6851263Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6851335Z Autotune Choices Stats: 2025-12-04T09:15:00.6852781Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_878", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6853024Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6853252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6853608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6854747Z triton_flex_attention_878 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6855853Z triton_flex_attention_879 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6857005Z triton_flex_attention_874 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6858118Z triton_flex_attention_876 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6859238Z triton_flex_attention_877 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6860389Z triton_flex_attention_875 0.0145 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6860663Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.4549 seconds precompiling for 6 choices 2025-12-04T09:15:00.6860730Z Autotune Choices Stats: 2025-12-04T09:15:00.6862371Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6862817Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6863146Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6863710Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6864927Z triton_flex_attention_backward_881 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6866075Z triton_flex_attention_backward_880 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6867232Z triton_flex_attention_backward_882 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6868440Z triton_flex_attention_backward_883 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6869658Z triton_flex_attention_backward_885 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6870848Z triton_flex_attention_backward_884 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6872003Z triton_flex_attention_backward_886 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6873156Z triton_flex_attention_backward_887 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6874346Z triton_flex_attention_backward_889 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6875490Z triton_flex_attention_backward_892 0.0172 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6875747Z SingleProcess AUTOTUNE benchmarking takes 0.6678 seconds and 2.4344 seconds precompiling for 13 choices 2025-12-04T09:15:00.6875878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6875961Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6876037Z unimplemented [] 2025-12-04T09:15:00.6876140Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6876330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6877542Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6877649Z graph_break [] 2025-12-04T09:15:00.6877778Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6877858Z Autotune Choices Stats: 2025-12-04T09:15:00.6879286Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_897", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6879532Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6879768Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6880084Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6881229Z triton_flex_attention_897 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6882386Z triton_flex_attention_898 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6883501Z triton_flex_attention_893 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6884615Z triton_flex_attention_895 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6885767Z triton_flex_attention_896 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6886920Z triton_flex_attention_894 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6887164Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3455 seconds precompiling for 6 choices 2025-12-04T09:15:00.6887232Z Autotune Choices Stats: 2025-12-04T09:15:00.6888696Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_899", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6889129Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6889469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6890071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6891244Z triton_flex_attention_backward_899 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6892438Z triton_flex_attention_backward_900 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6893599Z triton_flex_attention_backward_901 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6894822Z triton_flex_attention_backward_902 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6895965Z triton_flex_attention_backward_904 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6897153Z triton_flex_attention_backward_903 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6898299Z triton_flex_attention_backward_905 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6899485Z triton_flex_attention_backward_906 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6900645Z triton_flex_attention_backward_907 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6901795Z triton_flex_attention_backward_908 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6902044Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.2860 seconds precompiling for 13 choices 2025-12-04T09:15:00.6902215Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:15:00.6902348Z Traceback (most recent call last): 2025-12-04T09:15:00.6902692Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:15:00.6902762Z self.assertTrue( 2025-12-04T09:15:00.6902966Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:15:00.6903048Z raise self.failureException(msg) 2025-12-04T09:15:00.6903288Z AssertionError: False is not true : Log file /tmp/tmp8zz3dbsv/flex_attention_configs.json was not created 2025-12-04T09:15:00.6903293Z 2025-12-04T09:15:00.6903432Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.6903684Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.6903688Z 2025-12-04T09:15:00.6903856Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.6903987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6904061Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6904129Z unimplemented [] 2025-12-04T09:15:00.6904269Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6905453Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:15:00.6905642Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6905706Z graph_break [] 2025-12-04T09:15:00.6905837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6906865Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:15:00.6906964Z current_size = base.storage().size() 2025-12-04T09:15:00.6907035Z Autotune Choices Stats: 2025-12-04T09:15:00.6908434Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.6908689Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6908908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6909234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6910367Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6911550Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6912659Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6913805Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6914925Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6916065Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6916318Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:15:00.6916385Z Autotune Choices Stats: 2025-12-04T09:15:00.6917811Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.6918242Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6918571Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6919207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6920376Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6921559Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6922708Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6923851Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6925035Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6926188Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6927333Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6928549Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6929701Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6930902Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6931153Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:15:00.6931287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6931359Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6931430Z unimplemented [] 2025-12-04T09:15:00.6931536Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6931818Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6933058Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6933122Z graph_break [] 2025-12-04T09:15:00.6933261Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6933328Z Autotune Choices Stats: 2025-12-04T09:15:00.6934717Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6934975Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6935195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6935521Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6936732Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6937851Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6938995Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6940123Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6941288Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6942403Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6942658Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:15:00.6942726Z Autotune Choices Stats: 2025-12-04T09:15:00.6944165Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6944596Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6944995Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6945561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6946746Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6947930Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6949081Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6950259Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6951413Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6952564Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6953740Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6954916Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6956094Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6957242Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6957495Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:15:00.6957666Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6957737Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6957800Z unimplemented [] 2025-12-04T09:15:00.6957908Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6958090Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6959265Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6959328Z graph_break [] 2025-12-04T09:15:00.6959470Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6959541Z Autotune Choices Stats: 2025-12-04T09:15:00.6960914Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6961159Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6961445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6961940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6963081Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6964300Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6965435Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6966559Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6967720Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6968834Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6969084Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:15:00.6969150Z Autotune Choices Stats: 2025-12-04T09:15:00.6970637Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6971115Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6971443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6972005Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6973293Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6974449Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6975646Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6976797Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6977952Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6979127Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.6980313Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.6981502Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.6982646Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6983801Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.6984081Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:15:00.6984228Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.6984302Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.6984366Z unimplemented [] 2025-12-04T09:15:00.6984476Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.6984661Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.6985846Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.6985913Z graph_break [] 2025-12-04T09:15:00.6986041Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.6986121Z Autotune Choices Stats: 2025-12-04T09:15:00.6987542Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.6987825Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6988044Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6988369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6989525Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6990643Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6991756Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.6992907Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.6994021Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6995136Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.6995385Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:15:00.6995452Z Autotune Choices Stats: 2025-12-04T09:15:00.6996963Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.6997405Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.6997732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.6998350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.6999519Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7000690Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7001886Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7003044Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7004197Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7005410Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7006568Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7007750Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7008894Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7010077Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7010320Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:15:00.7010456Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7010525Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7010592Z unimplemented [] 2025-12-04T09:15:00.7010716Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7010905Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7012080Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7012139Z graph_break [] 2025-12-04T09:15:00.7012263Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7012381Z Autotune Choices Stats: 2025-12-04T09:15:00.7013866Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7014120Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7014337Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7014664Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7015836Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7016954Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7018101Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7019211Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7020315Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7021427Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7021746Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:15:00.7021818Z Autotune Choices Stats: 2025-12-04T09:15:00.7023237Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7023678Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7024039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7024617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7025783Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7026980Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7028129Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7029275Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7030501Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7031690Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7032866Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7034016Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7035194Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7036342Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7036605Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:15:00.7036742Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7036814Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7036882Z unimplemented [] 2025-12-04T09:15:00.7036988Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7037177Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7038362Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7038490Z graph_break [] 2025-12-04T09:15:00.7038629Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7038696Z Autotune Choices Stats: 2025-12-04T09:15:00.7040097Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.7040342Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7040595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7040923Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7042047Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7043170Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7044328Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7045438Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7046554Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7047786Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7048040Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:15:00.7048107Z Autotune Choices Stats: 2025-12-04T09:15:00.7049566Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7050004Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7050334Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7050900Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7052112Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7053351Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7054516Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7055706Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7056901Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7058087Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7059232Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7060396Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7061591Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7062918Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7063170Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:15:00.7063307Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7063378Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7063451Z unimplemented [] 2025-12-04T09:15:00.7063557Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7063743Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7065034Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7065099Z graph_break [] 2025-12-04T09:15:00.7065235Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7065303Z Autotune Choices Stats: 2025-12-04T09:15:00.7066781Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7067038Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7067258Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7067582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7068706Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7069891Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7071027Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7072151Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7073305Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7074452Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7074702Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:15:00.7074773Z Autotune Choices Stats: 2025-12-04T09:15:00.7076232Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.7076667Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7077000Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7077597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7078761Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7079919Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7081073Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7082288Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7083447Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7084645Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7085793Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7086976Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7088121Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7089276Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7089521Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:15:00.7089662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7089768Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7089867Z unimplemented [] 2025-12-04T09:15:00.7089972Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7090165Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7091346Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7091406Z graph_break [] 2025-12-04T09:15:00.7091539Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7091608Z Autotune Choices Stats: 2025-12-04T09:15:00.7093124Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.7093377Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7093598Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7093956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7095085Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7096204Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7097396Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7098564Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7099758Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7100906Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7101159Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:15:00.7101225Z Autotune Choices Stats: 2025-12-04T09:15:00.7102653Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7103124Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7103463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7104027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7105217Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7106382Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7107574Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7108757Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7109938Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7111087Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7112269Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7113421Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7114574Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7115763Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7116042Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:15:00.7116176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7116244Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7116311Z unimplemented [] 2025-12-04T09:15:00.7116416Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7116600Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7117807Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7117872Z graph_break [] 2025-12-04T09:15:00.7118005Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7118072Z Autotune Choices Stats: 2025-12-04T09:15:00.7119460Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7119747Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7119966Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7120287Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7121409Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7122531Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7123649Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7124857Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7125972Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7127117Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7127368Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:15:00.7127434Z Autotune Choices Stats: 2025-12-04T09:15:00.7128861Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7129325Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7129663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7130244Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7131413Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7132653Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7133851Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7135028Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7136187Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7137342Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7138522Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7139670Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7140849Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7142067Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7142312Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:15:00.7142447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7142515Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7142578Z unimplemented [] 2025-12-04T09:15:00.7142683Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7142871Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7144085Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7144147Z graph_break [] 2025-12-04T09:15:00.7144279Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7144345Z Autotune Choices Stats: 2025-12-04T09:15:00.7145726Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7146006Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7146221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7146544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7147671Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7148797Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7149980Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7151101Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7152295Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7153439Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7153723Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:15:00.7153790Z Autotune Choices Stats: 2025-12-04T09:15:00.7155221Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7155651Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7155989Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7156561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7157785Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7158969Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7160156Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7161303Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7162639Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7163878Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7165051Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7166212Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7167413Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7168613Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7168860Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:15:00.7169004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7169121Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7169188Z unimplemented [] 2025-12-04T09:15:00.7169302Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7169490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7170670Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7170769Z graph_break [] 2025-12-04T09:15:00.7170905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7170974Z Autotune Choices Stats: 2025-12-04T09:15:00.7172443Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.7172713Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7172934Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7173258Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7174382Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7175538Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7176686Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7177840Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7178960Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7180072Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7180354Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:15:00.7180421Z Autotune Choices Stats: 2025-12-04T09:15:00.7181844Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7182287Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7182612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7183183Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7184419Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7185589Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7186781Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7187929Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7189131Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7190278Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7191436Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7192668Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7193844Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7195248Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7195504Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:15:00.7195641Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7195710Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7195773Z unimplemented [] 2025-12-04T09:15:00.7195884Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7196081Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7197265Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7197367Z graph_break [] 2025-12-04T09:15:00.7197493Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7197567Z Autotune Choices Stats: 2025-12-04T09:15:00.7198961Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:15:00.7199209Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7199431Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7199753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7200946Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7202165Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7203315Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7204439Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7205551Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7206694Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7206942Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:15:00.7207008Z Autotune Choices Stats: 2025-12-04T09:15:00.7208427Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7208864Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7209184Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7209834Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7211034Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7212239Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7213450Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7214606Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7215794Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7216947Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7218098Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7219317Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7220468Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7221678Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7221935Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:15:00.7222071Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7222141Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7222240Z unimplemented [] 2025-12-04T09:15:00.7222350Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7222537Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7223716Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7223778Z graph_break [] 2025-12-04T09:15:00.7223904Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7223977Z Autotune Choices Stats: 2025-12-04T09:15:00.7225371Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7225619Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7225835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7226158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7227350Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7228476Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7229633Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7230757Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7231905Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7233024Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7233279Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:15:00.7233346Z Autotune Choices Stats: 2025-12-04T09:15:00.7234764Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.7235268Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7235593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7236161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7237359Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7238517Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7239685Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7240882Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7242038Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7243189Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7244408Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7245563Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7246742Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7247897Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7248179Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:15:00.7248315Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7248385Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7248448Z unimplemented [] 2025-12-04T09:15:00.7248555Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7248740Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7249934Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7250001Z graph_break [] 2025-12-04T09:15:00.7250132Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7250207Z Autotune Choices Stats: 2025-12-04T09:15:00.7251590Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7251876Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7252142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7252512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7253652Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7254811Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7255926Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7257080Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7258188Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7259308Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7259565Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:15:00.7259633Z Autotune Choices Stats: 2025-12-04T09:15:00.7261085Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7261555Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7262023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7262593Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7263812Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7264984Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7266204Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7267353Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7268503Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7269694Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7270886Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7272070Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7273219Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7274379Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7274664Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:15:00.7274796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7274865Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7274934Z unimplemented [] 2025-12-04T09:15:00.7275034Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7275216Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7276402Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7276462Z graph_break [] 2025-12-04T09:15:00.7276594Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7276660Z Autotune Choices Stats: 2025-12-04T09:15:00.7278083Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7278357Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7278579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7278902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7280079Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7281202Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7282322Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7283474Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7284593Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7285707Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7285956Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:15:00.7286078Z Autotune Choices Stats: 2025-12-04T09:15:00.7287540Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.7287974Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7288338Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7288902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7290069Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7291259Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7292505Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7293664Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7294858Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7296049Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7297233Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7298387Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7299542Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7300741Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7300992Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:15:00.7301125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7301206Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7301279Z unimplemented [] 2025-12-04T09:15:00.7301383Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7301566Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7302747Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7302807Z graph_break [] 2025-12-04T09:15:00.7302978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7303049Z Autotune Choices Stats: 2025-12-04T09:15:00.7304470Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7304715Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7304931Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7305298Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7306424Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7307547Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7308709Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7309819Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7310943Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7312086Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7312368Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:15:00.7312446Z Autotune Choices Stats: 2025-12-04T09:15:00.7313878Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:15:00.7314348Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7314681Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7315241Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7316451Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7317662Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7318816Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7319976Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7321192Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7322357Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7323548Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7324712Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7325895Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7327049Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7327295Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:15:00.7327426Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7327493Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7327561Z unimplemented [] 2025-12-04T09:15:00.7327664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7327846Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7329056Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7329151Z graph_break [] 2025-12-04T09:15:00.7329280Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7329346Z Autotune Choices Stats: 2025-12-04T09:15:00.7330733Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7331010Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7331227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7331564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7332746Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7333901Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7335020Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7336135Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7337366Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7338512Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7338763Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:15:00.7338830Z Autotune Choices Stats: 2025-12-04T09:15:00.7340286Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7340722Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7341051Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7341649Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7342817Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7343970Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7345130Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7346333Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7347527Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7348734Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7349883Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7351084Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7352227Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7353386Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7353631Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:15:00.7353771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7353844Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7353916Z unimplemented [] 2025-12-04T09:15:00.7354056Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7354275Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7355460Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7355521Z graph_break [] 2025-12-04T09:15:00.7355655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7355723Z Autotune Choices Stats: 2025-12-04T09:15:00.7357143Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.7357398Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7357616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7357940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7359107Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7360229Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7361385Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7362657Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7363893Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7365005Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7365278Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:15:00.7365391Z Autotune Choices Stats: 2025-12-04T09:15:00.7366825Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7367257Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7367653Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7368219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7369391Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7370555Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7371750Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7373083Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7374281Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7375445Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7376598Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7377783Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7378937Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7380100Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7380381Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:15:00.7380567Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7380639Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7380709Z unimplemented [] 2025-12-04T09:15:00.7380810Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7380994Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7382173Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7382236Z graph_break [] 2025-12-04T09:15:00.7382372Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7382475Z Autotune Choices Stats: 2025-12-04T09:15:00.7383860Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7384106Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7384357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7384678Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7385806Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7386927Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7388041Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7389191Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7390349Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7391491Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7391743Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:15:00.7391809Z Autotune Choices Stats: 2025-12-04T09:15:00.7393235Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7393701Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7394031Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7394608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7395787Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7396939Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7398163Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7399314Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7400505Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7401662Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7402843Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7403996Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7405143Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7406340Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7406616Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:15:00.7406752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7406821Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7406883Z unimplemented [] 2025-12-04T09:15:00.7406990Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7407188Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7408404Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7408468Z graph_break [] 2025-12-04T09:15:00.7408602Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7408669Z Autotune Choices Stats: 2025-12-04T09:15:00.7410066Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7410352Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7410581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7410908Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7412037Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7413212Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7414378Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7415521Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7416672Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7417786Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7418035Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:15:00.7418102Z Autotune Choices Stats: 2025-12-04T09:15:00.7419567Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7419994Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7420328Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7425772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7426970Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7428214Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7429418Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7430612Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7431788Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7432965Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7434129Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7435299Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7436481Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7437666Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7437922Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:15:00.7438075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7438152Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7438218Z unimplemented [] 2025-12-04T09:15:00.7438335Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7438557Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7439751Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7439814Z graph_break [] 2025-12-04T09:15:00.7439948Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7440045Z Autotune Choices Stats: 2025-12-04T09:15:00.7441487Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7441740Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7441972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7442297Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7443424Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7444538Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7445714Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7446828Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7447973Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7449086Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7449394Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:15:00.7449460Z Autotune Choices Stats: 2025-12-04T09:15:00.7450881Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.7451324Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7451649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7452220Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7453500Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7454686Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7455868Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7457012Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7458164Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7459338Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7460488Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7461638Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7463082Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7464286Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7464593Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:15:00.7464737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7464810Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7464874Z unimplemented [] 2025-12-04T09:15:00.7464994Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7465183Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7466358Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7466472Z graph_break [] 2025-12-04T09:15:00.7466606Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7466680Z Autotune Choices Stats: 2025-12-04T09:15:00.7468078Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7468332Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7468551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7468877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7470010Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7471222Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7472337Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7473493Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7474606Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7475799Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7476051Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:15:00.7476117Z Autotune Choices Stats: 2025-12-04T09:15:00.7477551Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7477992Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7478320Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7478929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7480129Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7481339Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7482494Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7483639Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7484835Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7485986Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7487141Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7488323Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7489508Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7490689Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7490936Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:15:00.7491073Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7491143Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7491205Z unimplemented [] 2025-12-04T09:15:00.7491313Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7491498Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7492795Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7492859Z graph_break [] 2025-12-04T09:15:00.7492986Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7493061Z Autotune Choices Stats: 2025-12-04T09:15:00.7494453Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7494705Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7494922Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7495242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7496407Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7497571Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7498714Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7499832Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7500936Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7502079Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7502333Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:15:00.7502402Z Autotune Choices Stats: 2025-12-04T09:15:00.7503829Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:15:00.7504268Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7504660Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7505235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7506398Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7507603Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7508758Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7509934Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7511085Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7512234Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7513413Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7514612Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7515789Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7516947Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7517189Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:15:00.7517358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7517427Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7517490Z unimplemented [] 2025-12-04T09:15:00.7517600Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7517784Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7518963Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7519023Z graph_break [] 2025-12-04T09:15:00.7519155Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7519227Z Autotune Choices Stats: 2025-12-04T09:15:00.7520631Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7520879Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7521097Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7521491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7522618Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7523761Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7524876Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7525989Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7527129Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7528243Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7528490Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:15:00.7528565Z Autotune Choices Stats: 2025-12-04T09:15:00.7530034Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7530505Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7530829Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7531399Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7532669Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7533843Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7535037Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7536183Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7537335Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7538479Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7539693Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7540842Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7542016Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7543171Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7543454Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:15:00.7543593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7543662Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7543725Z unimplemented [] 2025-12-04T09:15:00.7543833Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7544017Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7545208Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7545274Z graph_break [] 2025-12-04T09:15:00.7545402Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7545472Z Autotune Choices Stats: 2025-12-04T09:15:00.7546917Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.7547200Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7547416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7547741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7548867Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7550031Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7551147Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7552295Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7553401Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7554529Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7554777Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:15:00.7554865Z Autotune Choices Stats: 2025-12-04T09:15:00.7556330Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7556803Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7557126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7557728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7558894Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7560063Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7561251Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7562541Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7563700Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7564955Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7566126Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7567323Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7568478Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7569688Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7569931Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:15:00.7570065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7570133Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7570194Z unimplemented [] 2025-12-04T09:15:00.7570305Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7570488Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7571668Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7571738Z graph_break [] 2025-12-04T09:15:00.7571866Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7571938Z Autotune Choices Stats: 2025-12-04T09:15:00.7573422Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.7573708Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7573922Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7574250Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7575416Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7576528Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7577695Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7578810Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7579920Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7581034Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7581342Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:15:00.7581415Z Autotune Choices Stats: 2025-12-04T09:15:00.7582832Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7583269Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7583632Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7584195Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7585365Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7586563Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7587714Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7588868Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7590069Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7591249Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7592439Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7593586Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7594742Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7595926Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7596176Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:15:00.7596312Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7596381Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7596443Z unimplemented [] 2025-12-04T09:15:00.7596553Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7596736Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7597913Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7598012Z graph_break [] 2025-12-04T09:15:00.7598170Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7598245Z Autotune Choices Stats: 2025-12-04T09:15:00.7599633Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7599878Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7600129Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7600449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7601581Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7602698Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7603838Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7604965Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7606082Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7607270Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7607515Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:15:00.7607588Z Autotune Choices Stats: 2025-12-04T09:15:00.7609118Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.7609566Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7609891Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7610459Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7611660Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7612867Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7614018Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7615215Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7616400Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7617578Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7618733Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7619881Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7621066Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7622222Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7622472Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:15:00.7622600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7622676Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7622737Z unimplemented [] 2025-12-04T09:15:00.7622844Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7623032Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7624272Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7624341Z graph_break [] 2025-12-04T09:15:00.7624468Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7624541Z Autotune Choices Stats: 2025-12-04T09:15:00.7625962Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.7626212Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7626426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7626752Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7627896Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7629053Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7630160Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7631283Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7632425Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7633570Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7633811Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:15:00.7633887Z Autotune Choices Stats: 2025-12-04T09:15:00.7635339Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.7635777Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7636102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7636700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7637856Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7639019Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7640173Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7641417Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7642575Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7643992Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7645158Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7646341Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7647494Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7648652Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7648896Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:15:00.7649024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7649135Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7649197Z unimplemented [] 2025-12-04T09:15:00.7649335Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7649520Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7650691Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7650758Z graph_break [] 2025-12-04T09:15:00.7650884Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7650960Z Autotune Choices Stats: 2025-12-04T09:15:00.7652447Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7652700Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7652917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7653268Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7654402Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7655519Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7656631Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7657776Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7658913Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7660105Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7660356Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:15:00.7660429Z Autotune Choices Stats: 2025-12-04T09:15:00.7662007Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.7662538Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7662868Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7663431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7664609Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7665768Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7666964Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7668158Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7669356Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7670500Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7671688Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7672835Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7673999Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7675199Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7675478Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:15:00.7675621Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7675696Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7675762Z unimplemented [] 2025-12-04T09:15:00.7675872Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7676057Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7677264Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7677337Z graph_break [] 2025-12-04T09:15:00.7677467Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7677540Z Autotune Choices Stats: 2025-12-04T09:15:00.7678923Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.7679208Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7679426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7679756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7680895Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7682009Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7683120Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7684304Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7685411Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7686597Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7686842Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:15:00.7686915Z Autotune Choices Stats: 2025-12-04T09:15:00.7688341Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.7688816Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7689140Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7689712Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7690874Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7692067Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7693322Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7694506Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7695658Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7696805Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7697988Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7699131Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7700284Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7701492Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7701734Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:15:00.7701860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7701932Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7701992Z unimplemented [] 2025-12-04T09:15:00.7702092Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7702282Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7703489Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7703554Z graph_break [] 2025-12-04T09:15:00.7703679Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7703748Z Autotune Choices Stats: 2025-12-04T09:15:00.7705143Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.7705441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7705659Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7705973Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7707109Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7708220Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7709375Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7710528Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7711674Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7712791Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7713065Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:15:00.7713135Z Autotune Choices Stats: 2025-12-04T09:15:00.7714563Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:15:00.7714996Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7715325Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7715894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7717051Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7718274Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7719434Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7720617Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7721768Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7722941Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7724109Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7725261Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7726447Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7727642Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7727888Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:15:00.7728020Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7728129Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7728192Z unimplemented [] 2025-12-04T09:15:00.7728292Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7728481Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7729649Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7729751Z graph_break [] 2025-12-04T09:15:00.7729880Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7729953Z Autotune Choices Stats: 2025-12-04T09:15:00.7731338Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7731584Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7731800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7732116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7733307Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7734455Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7735605Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7736772Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7737886Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7738998Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7739272Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:15:00.7739343Z Autotune Choices Stats: 2025-12-04T09:15:00.7740761Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.7741201Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7741525Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7742092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7743375Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7744535Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7745730Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7746893Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7748090Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7749237Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7750400Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7751580Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7752772Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7753961Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7754219Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:15:00.7754348Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7754423Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7754484Z unimplemented [] 2025-12-04T09:15:00.7754584Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7754775Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7755945Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7756052Z graph_break [] 2025-12-04T09:15:00.7756177Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7756243Z Autotune Choices Stats: 2025-12-04T09:15:00.7757635Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7757882Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7758098Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7758409Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7759574Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7760715Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7762075Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7763235Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7764354Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7765513Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7765752Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:15:00.7765823Z Autotune Choices Stats: 2025-12-04T09:15:00.7767252Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.7767690Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7768016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7768694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7769858Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7771058Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7772209Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7773465Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7774652Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7775811Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7776967Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7778184Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7779338Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7780526Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7780776Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:15:00.7780906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7780985Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7781081Z unimplemented [] 2025-12-04T09:15:00.7781185Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7781386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7782561Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7782629Z graph_break [] 2025-12-04T09:15:00.7782763Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7782829Z Autotune Choices Stats: 2025-12-04T09:15:00.7784228Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.7784478Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7784704Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7785024Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7786230Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7787344Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7788498Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7789622Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7790780Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7791895Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7792143Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:15:00.7792218Z Autotune Choices Stats: 2025-12-04T09:15:00.7793645Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.7794151Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7794484Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7795055Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7796256Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7797415Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7798565Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7799763Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7800927Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7802078Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7803280Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7804462Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7805654Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7806799Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7807092Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:15:00.7807226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7807303Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7807367Z unimplemented [] 2025-12-04T09:15:00.7807472Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7807665Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7808835Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7808910Z graph_break [] 2025-12-04T09:15:00.7809039Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7809111Z Autotune Choices Stats: 2025-12-04T09:15:00.7810510Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.7810790Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7811056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7811373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7812564Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7813711Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7814834Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7815984Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7817093Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7818215Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7818457Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:15:00.7818529Z Autotune Choices Stats: 2025-12-04T09:15:00.7819986Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.7820465Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7820799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7821372Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7822580Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7823739Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7824919Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7826074Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7827223Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7828412Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7829599Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7830778Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7831929Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7833079Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7833380Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:15:00.7833507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7833580Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7833641Z unimplemented [] 2025-12-04T09:15:00.7833741Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7833931Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7835119Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7835185Z graph_break [] 2025-12-04T09:15:00.7835311Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7835376Z Autotune Choices Stats: 2025-12-04T09:15:00.7836812Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.7837090Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7837309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7837626Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7838796Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7839910Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7841027Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7842163Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7843284Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7844405Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7844649Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:15:00.7844761Z Autotune Choices Stats: 2025-12-04T09:15:00.7846214Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:15:00.7846651Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7847006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7847568Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7848746Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7849935Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7851079Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7852234Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7853602Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7854789Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7855983Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7857130Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7858286Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7859463Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7859716Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:15:00.7859844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7859919Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7859984Z unimplemented [] 2025-12-04T09:15:00.7860086Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7860276Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7861441Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7861507Z graph_break [] 2025-12-04T09:15:00.7861632Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7861736Z Autotune Choices Stats: 2025-12-04T09:15:00.7863355Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.7863605Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7863832Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7864157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7865368Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7866493Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7867660Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7868771Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7869900Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7871049Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7871344Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:15:00.7871410Z Autotune Choices Stats: 2025-12-04T09:15:00.7872840Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.7873317Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7873642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7874203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7875377Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7876568Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7877715Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7878879Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7880152Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7881317Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7882500Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7883642Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7884831Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7885979Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7886230Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:15:00.7886358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7886430Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7886490Z unimplemented [] 2025-12-04T09:15:00.7886589Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7886776Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7887986Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7888082Z graph_break [] 2025-12-04T09:15:00.7888207Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7888272Z Autotune Choices Stats: 2025-12-04T09:15:00.7889670Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.7889946Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7890180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7890496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7891625Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7892841Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7893955Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7895073Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7896184Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7897392Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7897634Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:15:00.7897700Z Autotune Choices Stats: 2025-12-04T09:15:00.7899161Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.7899608Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7899929Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7900525Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7901691Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7902842Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7903982Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7905167Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7906341Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7907519Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7908671Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7909855Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7911001Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7912149Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7912414Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:15:00.7912545Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7912618Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7912679Z unimplemented [] 2025-12-04T09:15:00.7912778Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7913005Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7914207Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7914276Z graph_break [] 2025-12-04T09:15:00.7914402Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7914472Z Autotune Choices Stats: 2025-12-04T09:15:00.7915894Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7916138Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7916361Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7916674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7917812Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7918956Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7920076Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7921181Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7922372Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7923493Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7923741Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:15:00.7923842Z Autotune Choices Stats: 2025-12-04T09:15:00.7925287Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.7925724Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7926088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7926650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7927821Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7928983Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7930189Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7931369Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7932597Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7933767Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7934942Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7936122Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7937273Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7938415Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7938696Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:15:00.7938858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7938934Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7938996Z unimplemented [] 2025-12-04T09:15:00.7939096Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7939283Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7940450Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7940529Z graph_break [] 2025-12-04T09:15:00.7940658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7940757Z Autotune Choices Stats: 2025-12-04T09:15:00.7942148Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7942388Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7942645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7942963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7944096Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7945217Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7946337Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7947474Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7948617Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7949763Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7950006Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:15:00.7950070Z Autotune Choices Stats: 2025-12-04T09:15:00.7951495Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.7951965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7952286Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7952844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7954019Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7955170Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7956380Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7957542Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7958723Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7959870Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7961075Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7962358Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7963517Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7964727Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7965031Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:15:00.7965163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7965235Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7965296Z unimplemented [] 2025-12-04T09:15:00.7965399Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7965587Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7966802Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7966872Z graph_break [] 2025-12-04T09:15:00.7966998Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7967073Z Autotune Choices Stats: 2025-12-04T09:15:00.7968465Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.7968765Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7968987Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7969307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7970445Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7971563Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7972791Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.7973934Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7975083Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.7976205Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7976449Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:15:00.7976514Z Autotune Choices Stats: 2025-12-04T09:15:00.7977942Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:15:00.7978407Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7978736Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7979300Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7980477Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7981669Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7982872Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7984053Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7985195Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.7986386Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7987529Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.7988688Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.7989868Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7991044Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.7991296Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:15:00.7991423Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.7991498Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.7991561Z unimplemented [] 2025-12-04T09:15:00.7991663Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.7991907Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.7993094Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.7993161Z graph_break [] 2025-12-04T09:15:00.7993290Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.7993353Z Autotune Choices Stats: 2025-12-04T09:15:00.7994778Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.7995020Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.7995244Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.7995560Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.7996695Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7997802Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.7998978Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8000101Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8001254Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8002365Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8002640Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:15:00.8002705Z Autotune Choices Stats: 2025-12-04T09:15:00.8004144Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.8004584Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8004918Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8005478Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8006679Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8007858Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8009038Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8010195Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8011345Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8012644Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8013791Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8014941Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8016159Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8017297Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8017582Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:15:00.8017714Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8017781Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8017846Z unimplemented [] 2025-12-04T09:15:00.8017959Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8018146Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8019318Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8019417Z graph_break [] 2025-12-04T09:15:00.8019544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8019609Z Autotune Choices Stats: 2025-12-04T09:15:00.8021002Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.8021243Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8021466Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8021778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8022903Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8024096Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8025215Z triton_flex_attention_801 0.0143 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8026356Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8027471Z triton_flex_attention_800 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8028614Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8028865Z SingleProcess AUTOTUNE benchmarking takes 0.2967 seconds and 1.3426 seconds precompiling for 6 choices 2025-12-04T09:15:00.8028930Z Autotune Choices Stats: 2025-12-04T09:15:00.8030368Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.8030800Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8031127Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8031698Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8032934Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8034114Z triton_flex_attention_backward_807 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8035265Z triton_flex_attention_backward_804 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8036415Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8037592Z triton_flex_attention_backward_809 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8038743Z triton_flex_attention_backward_810 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8039891Z triton_flex_attention_backward_813 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8041073Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8042254Z triton_flex_attention_backward_811 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8043424Z triton_flex_attention_backward_816 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8043674Z SingleProcess AUTOTUNE benchmarking takes 0.6691 seconds and 2.4016 seconds precompiling for 13 choices 2025-12-04T09:15:00.8043801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8043868Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8043935Z unimplemented [] 2025-12-04T09:15:00.8044034Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8044224Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8045429Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8045489Z graph_break [] 2025-12-04T09:15:00.8045623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8045690Z Autotune Choices Stats: 2025-12-04T09:15:00.8047079Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.8047323Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8047549Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8047861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8049036Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8050176Z triton_flex_attention_822 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8051330Z triton_flex_attention_819 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8052482Z triton_flex_attention_820 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8053610Z triton_flex_attention_817 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8054762Z triton_flex_attention_818 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8055011Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.3304 seconds precompiling for 6 choices 2025-12-04T09:15:00.8055080Z Autotune Choices Stats: 2025-12-04T09:15:00.8056512Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_826", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.8056939Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8057355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8057917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8059088Z triton_flex_attention_backward_826 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8060287Z triton_flex_attention_backward_823 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8061437Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8062789Z triton_flex_attention_backward_825 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8063942Z triton_flex_attention_backward_828 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8065094Z triton_flex_attention_backward_829 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8066299Z triton_flex_attention_backward_830 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8067502Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8068713Z triton_flex_attention_backward_835 0.0173 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8069874Z triton_flex_attention_backward_831 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8070123Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.3188 seconds precompiling for 13 choices 2025-12-04T09:15:00.8070301Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8070369Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8070435Z unimplemented [] 2025-12-04T09:15:00.8070539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8070722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8071899Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8071957Z graph_break [] 2025-12-04T09:15:00.8072091Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8072158Z Autotune Choices Stats: 2025-12-04T09:15:00.8073552Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8073792Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8074013Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8074400Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8075532Z triton_flex_attention_840 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8076680Z triton_flex_attention_841 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8077803Z triton_flex_attention_838 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8078912Z triton_flex_attention_836 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8080076Z triton_flex_attention_839 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8081186Z triton_flex_attention_837 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8081437Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3626 seconds precompiling for 6 choices 2025-12-04T09:15:00.8081502Z Autotune Choices Stats: 2025-12-04T09:15:00.8082936Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_844", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.8083433Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8083766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8084323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8085746Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8086905Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8088062Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8089264Z triton_flex_attention_backward_843 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8090411Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8091563Z triton_flex_attention_backward_847 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8092832Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8093992Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8095180Z triton_flex_attention_backward_849 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8096329Z triton_flex_attention_backward_854 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8096617Z SingleProcess AUTOTUNE benchmarking takes 0.6694 seconds and 2.3460 seconds precompiling for 13 choices 2025-12-04T09:15:00.8096744Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8096812Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8096879Z unimplemented [] 2025-12-04T09:15:00.8096981Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8097163Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8098339Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8098401Z graph_break [] 2025-12-04T09:15:00.8098539Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8098605Z Autotune Choices Stats: 2025-12-04T09:15:00.8100002Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8100310Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8100533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8100845Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8101971Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8103127Z triton_flex_attention_860 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8104244Z triton_flex_attention_855 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8105394Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8106517Z triton_flex_attention_858 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8107630Z triton_flex_attention_856 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8107874Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3732 seconds precompiling for 6 choices 2025-12-04T09:15:00.8107939Z Autotune Choices Stats: 2025-12-04T09:15:00.8109412Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.8109873Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8110200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8110800Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8111971Z triton_flex_attention_backward_862 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8113127Z triton_flex_attention_backward_864 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8114324Z triton_flex_attention_backward_861 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8115484Z triton_flex_attention_backward_863 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8116629Z triton_flex_attention_backward_866 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8117815Z triton_flex_attention_backward_865 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8118996Z triton_flex_attention_backward_867 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8120204Z triton_flex_attention_backward_868 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8121355Z triton_flex_attention_backward_870 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8122539Z triton_flex_attention_backward_873 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8122789Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.4237 seconds precompiling for 13 choices 2025-12-04T09:15:00.8122928Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8123002Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8123070Z unimplemented [] 2025-12-04T09:15:00.8123172Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8123357Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8124531Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8124590Z graph_break [] 2025-12-04T09:15:00.8124721Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8124786Z Autotune Choices Stats: 2025-12-04T09:15:00.8126206Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_878", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.8126483Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8126703Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8127019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8128177Z triton_flex_attention_878 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8129290Z triton_flex_attention_879 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8130438Z triton_flex_attention_874 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8131556Z triton_flex_attention_876 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8132728Z triton_flex_attention_877 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8133844Z triton_flex_attention_875 0.0145 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8134176Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.4549 seconds precompiling for 6 choices 2025-12-04T09:15:00.8134243Z Autotune Choices Stats: 2025-12-04T09:15:00.8135693Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.8136123Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8136490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8137049Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8138214Z triton_flex_attention_backward_881 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8139397Z triton_flex_attention_backward_880 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8140549Z triton_flex_attention_backward_882 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8141709Z triton_flex_attention_backward_883 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8142887Z triton_flex_attention_backward_885 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8144070Z triton_flex_attention_backward_884 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8145301Z triton_flex_attention_backward_886 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8146457Z triton_flex_attention_backward_887 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8147606Z triton_flex_attention_backward_889 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8148789Z triton_flex_attention_backward_892 0.0172 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8149044Z SingleProcess AUTOTUNE benchmarking takes 0.6678 seconds and 2.4344 seconds precompiling for 13 choices 2025-12-04T09:15:00.8149173Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8149240Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8149308Z unimplemented [] 2025-12-04T09:15:00.8149406Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8149588Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8150760Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8150854Z graph_break [] 2025-12-04T09:15:00.8151031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8151101Z Autotune Choices Stats: 2025-12-04T09:15:00.8152490Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_897", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.8152729Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8152963Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8153314Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8154442Z triton_flex_attention_897 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8155562Z triton_flex_attention_898 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8156712Z triton_flex_attention_893 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8157828Z triton_flex_attention_895 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8158953Z triton_flex_attention_896 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8160098Z triton_flex_attention_894 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8160377Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3455 seconds precompiling for 6 choices 2025-12-04T09:15:00.8160443Z Autotune Choices Stats: 2025-12-04T09:15:00.8162075Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_899", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.8162514Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8162848Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8163411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8164630Z triton_flex_attention_backward_899 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8165784Z triton_flex_attention_backward_900 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8166945Z triton_flex_attention_backward_901 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8168140Z triton_flex_attention_backward_902 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8169350Z triton_flex_attention_backward_904 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8170544Z triton_flex_attention_backward_903 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8171693Z triton_flex_attention_backward_905 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8172900Z triton_flex_attention_backward_906 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8174088Z triton_flex_attention_backward_907 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8175249Z triton_flex_attention_backward_908 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8175502Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.2860 seconds precompiling for 13 choices 2025-12-04T09:15:00.8175632Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8175700Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8175768Z unimplemented [] 2025-12-04T09:15:00.8175870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8176051Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8177265Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8177358Z graph_break [] 2025-12-04T09:15:00.8177489Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8177554Z Autotune Choices Stats: 2025-12-04T09:15:00.8178974Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_916", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.8179219Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8179434Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8179759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8180891Z triton_flex_attention_916 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8182042Z triton_flex_attention_917 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8183160Z triton_flex_attention_915 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8184266Z triton_flex_attention_912 0.0143 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8185432Z triton_flex_attention_914 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8186576Z triton_flex_attention_913 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8186825Z SingleProcess AUTOTUNE benchmarking takes 0.2932 seconds and 1.3622 seconds precompiling for 6 choices 2025-12-04T09:15:00.8186902Z Autotune Choices Stats: 2025-12-04T09:15:00.8188370Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_918", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.8188801Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8189132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8189737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8190907Z triton_flex_attention_backward_918 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8192067Z triton_flex_attention_backward_919 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8193220Z triton_flex_attention_backward_920 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8194431Z triton_flex_attention_backward_921 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8195583Z triton_flex_attention_backward_923 0.0152 ms 94.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8196776Z triton_flex_attention_backward_922 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8197919Z triton_flex_attention_backward_924 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8199105Z triton_flex_attention_backward_925 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8200264Z triton_flex_attention_backward_926 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8201413Z triton_flex_attention_backward_927 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8201671Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.2888 seconds precompiling for 13 choices 2025-12-04T09:15:00.8201844Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:15:00.8201956Z Traceback (most recent call last): 2025-12-04T09:15:00.8202295Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:15:00.8202364Z self.assertTrue( 2025-12-04T09:15:00.8202561Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:15:00.8202648Z raise self.failureException(msg) 2025-12-04T09:15:00.8202887Z AssertionError: False is not true : Log file /tmp/tmpa4rfsguv/flex_attention_configs.json was not created 2025-12-04T09:15:00.8202893Z 2025-12-04T09:15:00.8203031Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.8203281Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.8203286Z 2025-12-04T09:15:00.8203448Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.8203584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8203669Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8203738Z unimplemented [] 2025-12-04T09:15:00.8203885Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8205071Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('async_compile_cache_miss', 23), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:15:00.8205259Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8205320Z graph_break [] 2025-12-04T09:15:00.8205452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8206471Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:15:00.8206552Z current_size = base.storage().size() 2025-12-04T09:15:00.8206623Z Autotune Choices Stats: 2025-12-04T09:15:00.8208014Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.8208272Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8208490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8208814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8209981Z triton_flex_attention_4 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8211123Z triton_flex_attention_5 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8212229Z triton_flex_attention_0 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8213439Z triton_flex_attention_2 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8214550Z triton_flex_attention_3 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8215716Z triton_flex_attention_1 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8215964Z SingleProcess AUTOTUNE benchmarking takes 0.2600 seconds and 1.4244 seconds precompiling for 6 choices 2025-12-04T09:15:00.8216032Z Autotune Choices Stats: 2025-12-04T09:15:00.8217457Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.8217896Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8218219Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8218853Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8220010Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8221202Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8222357Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8223501Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8224683Z triton_flex_attention_backward_11 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8225831Z triton_flex_attention_backward_12 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8226994Z triton_flex_attention_backward_13 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8228206Z triton_flex_attention_backward_10 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8229354Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8230534Z triton_flex_attention_backward_15 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8230780Z SingleProcess AUTOTUNE benchmarking takes 0.8717 seconds and 2.2315 seconds precompiling for 13 choices 2025-12-04T09:15:00.8230917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8230986Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8231049Z unimplemented [] 2025-12-04T09:15:00.8231161Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8231381Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8232558Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8232618Z graph_break [] 2025-12-04T09:15:00.8232744Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8232816Z Autotune Choices Stats: 2025-12-04T09:15:00.8234202Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8234455Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8234671Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8234993Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8236191Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8237312Z triton_flex_attention_24 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8238455Z triton_flex_attention_19 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8239572Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8240714Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8241824Z triton_flex_attention_20 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8242069Z SingleProcess AUTOTUNE benchmarking takes 0.2898 seconds and 1.2778 seconds precompiling for 6 choices 2025-12-04T09:15:00.8242140Z Autotune Choices Stats: 2025-12-04T09:15:00.8243579Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8244014Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8244404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8244972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8246131Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8247333Z triton_flex_attention_backward_27 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8248486Z triton_flex_attention_backward_28 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8249661Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8250811Z triton_flex_attention_backward_30 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8251951Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8253209Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8254392Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8255573Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8256726Z triton_flex_attention_backward_37 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8257014Z SingleProcess AUTOTUNE benchmarking takes 0.6605 seconds and 2.3199 seconds precompiling for 13 choices 2025-12-04T09:15:00.8257151Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8257223Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8257284Z unimplemented [] 2025-12-04T09:15:00.8257391Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8257573Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8258736Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8258804Z graph_break [] 2025-12-04T09:15:00.8258930Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8259003Z Autotune Choices Stats: 2025-12-04T09:15:00.8260388Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8260633Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8260917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8261242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8262511Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8263694Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8264807Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8265924Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8267073Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8268189Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8268433Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2616 seconds precompiling for 6 choices 2025-12-04T09:15:00.8268504Z Autotune Choices Stats: 2025-12-04T09:15:00.8269968Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8270447Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8270770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8271336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8272521Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8273683Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8274864Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8276011Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8277160Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8278349Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8279612Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8280784Z triton_flex_attention_backward_50 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8281932Z triton_flex_attention_backward_53 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8283090Z triton_flex_attention_backward_56 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8283379Z SingleProcess AUTOTUNE benchmarking takes 0.6631 seconds and 2.2635 seconds precompiling for 13 choices 2025-12-04T09:15:00.8283524Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8283593Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8283654Z unimplemented [] 2025-12-04T09:15:00.8283764Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8283951Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8285128Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8285204Z graph_break [] 2025-12-04T09:15:00.8285335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8285406Z Autotune Choices Stats: 2025-12-04T09:15:00.8286827Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8287109Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8287326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8287646Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8288932Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8290057Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8291168Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8292358Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8293485Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8294607Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8294855Z SingleProcess AUTOTUNE benchmarking takes 0.2907 seconds and 1.2500 seconds precompiling for 6 choices 2025-12-04T09:15:00.8294929Z Autotune Choices Stats: 2025-12-04T09:15:00.8296425Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8296866Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8297197Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8297800Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8298968Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8300136Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8301319Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8302483Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8303636Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8304862Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8306017Z triton_flex_attention_backward_69 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8307188Z triton_flex_attention_backward_70 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8308335Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8309520Z triton_flex_attention_backward_75 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8309764Z SingleProcess AUTOTUNE benchmarking takes 0.6601 seconds and 2.1469 seconds precompiling for 13 choices 2025-12-04T09:15:00.8309891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8309968Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8310031Z unimplemented [] 2025-12-04T09:15:00.8310140Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8310327Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8311499Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8311564Z graph_break [] 2025-12-04T09:15:00.8311690Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8311763Z Autotune Choices Stats: 2025-12-04T09:15:00.8313230Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8313478Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8313695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8314009Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8315183Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8316303Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8317446Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8318554Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8319673Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8320816Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8321087Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.2576 seconds precompiling for 6 choices 2025-12-04T09:15:00.8321158Z Autotune Choices Stats: 2025-12-04T09:15:00.8322580Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_84", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8323018Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8323372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8323941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8325107Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8326291Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8327437Z triton_flex_attention_backward_83 0.0122 ms 92.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8328589Z triton_flex_attention_backward_82 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8329769Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8330942Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8332122Z triton_flex_attention_backward_88 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8333308Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8334503Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8335649Z triton_flex_attention_backward_91 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8335898Z SingleProcess AUTOTUNE benchmarking takes 0.6599 seconds and 2.2799 seconds precompiling for 13 choices 2025-12-04T09:15:00.8336027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8336101Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8336163Z unimplemented [] 2025-12-04T09:15:00.8336270Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8338420Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8339614Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8341081Z graph_break [] 2025-12-04T09:15:00.8341248Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8341319Z Autotune Choices Stats: 2025-12-04T09:15:00.8342729Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.8343002Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8343281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8343611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8344747Z triton_flex_attention_99 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8345861Z triton_flex_attention_100 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8346969Z triton_flex_attention_95 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8348078Z triton_flex_attention_98 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8349184Z triton_flex_attention_97 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8350418Z triton_flex_attention_96 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8350710Z SingleProcess AUTOTUNE benchmarking takes 0.2905 seconds and 1.2994 seconds precompiling for 6 choices 2025-12-04T09:15:00.8350784Z Autotune Choices Stats: 2025-12-04T09:15:00.8352249Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_102", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8352701Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8353040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8353604Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8354782Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8355934Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8357074Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8358300Z triton_flex_attention_backward_101 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8359496Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8360674Z triton_flex_attention_backward_107 0.0133 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8361989Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8363138Z triton_flex_attention_backward_108 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8364284Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8365421Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8365677Z SingleProcess AUTOTUNE benchmarking takes 0.6615 seconds and 2.2434 seconds precompiling for 13 choices 2025-12-04T09:15:00.8365901Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8365981Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8366047Z unimplemented [] 2025-12-04T09:15:00.8366153Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8366350Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8367581Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8367704Z graph_break [] 2025-12-04T09:15:00.8367835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8367904Z Autotune Choices Stats: 2025-12-04T09:15:00.8369351Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8369600Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8369826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8370147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8371269Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8372422Z triton_flex_attention_119 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8373540Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8374641Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8375825Z triton_flex_attention_114 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8376981Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8377225Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2587 seconds precompiling for 6 choices 2025-12-04T09:15:00.8377292Z Autotune Choices Stats: 2025-12-04T09:15:00.8378757Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.8379202Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8379526Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8380089Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8381250Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8382405Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8383577Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8384753Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8385926Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8387103Z triton_flex_attention_backward_124 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8388243Z triton_flex_attention_backward_126 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8389394Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8390539Z triton_flex_attention_backward_129 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8391675Z triton_flex_attention_backward_132 0.0145 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8391967Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.3330 seconds precompiling for 13 choices 2025-12-04T09:15:00.8392094Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8392170Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8392263Z unimplemented [] 2025-12-04T09:15:00.8392403Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8392594Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8393763Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8393829Z graph_break [] 2025-12-04T09:15:00.8393955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8394022Z Autotune Choices Stats: 2025-12-04T09:15:00.8395441Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.8395689Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8395913Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8396231Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8397358Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8398462Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8399570Z triton_flex_attention_136 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8400740Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8401877Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8403024Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8403269Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3488 seconds precompiling for 6 choices 2025-12-04T09:15:00.8403334Z Autotune Choices Stats: 2025-12-04T09:15:00.8404765Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_140", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8405205Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8405536Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8406098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8407260Z triton_flex_attention_backward_140 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8408412Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8409634Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8410811Z triton_flex_attention_backward_142 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8411978Z triton_flex_attention_backward_144 0.0131 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8413185Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8414331Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8415489Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8416663Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8417940Z triton_flex_attention_backward_149 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8418227Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.2392 seconds precompiling for 13 choices 2025-12-04T09:15:00.8422560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8422669Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8422748Z unimplemented [] 2025-12-04T09:15:00.8422870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8423069Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8424334Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8424406Z graph_break [] 2025-12-04T09:15:00.8424549Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8424628Z Autotune Choices Stats: 2025-12-04T09:15:00.8426035Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8426300Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8426526Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8426855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8428000Z triton_flex_attention_156 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8429113Z triton_flex_attention_157 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8430283Z triton_flex_attention_152 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8431432Z triton_flex_attention_154 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8432589Z triton_flex_attention_155 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8433732Z triton_flex_attention_153 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8433982Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3309 seconds precompiling for 6 choices 2025-12-04T09:15:00.8434058Z Autotune Choices Stats: 2025-12-04T09:15:00.8435499Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_160", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8435940Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8436272Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8436842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8438013Z triton_flex_attention_backward_160 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8439256Z triton_flex_attention_backward_158 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8440444Z triton_flex_attention_backward_159 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8441626Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8442774Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8443920Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8445065Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8446217Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8447415Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8448604Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8448888Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.1546 seconds precompiling for 13 choices 2025-12-04T09:15:00.8449033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8449107Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8449171Z unimplemented [] 2025-12-04T09:15:00.8449283Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8449474Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8450694Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8450758Z graph_break [] 2025-12-04T09:15:00.8450888Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8450961Z Autotune Choices Stats: 2025-12-04T09:15:00.8452413Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8452670Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8452892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8453219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8454348Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8455459Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8456661Z triton_flex_attention_171 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8457815Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8458959Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8460086Z triton_flex_attention_172 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8460335Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3199 seconds precompiling for 6 choices 2025-12-04T09:15:00.8460411Z Autotune Choices Stats: 2025-12-04T09:15:00.8461978Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_178", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8462431Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8462760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8463329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8464633Z triton_flex_attention_backward_178 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8465843Z triton_flex_attention_backward_177 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8467044Z triton_flex_attention_backward_180 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8468186Z triton_flex_attention_backward_179 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8469340Z triton_flex_attention_backward_182 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8470484Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8471636Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8472780Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8473996Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8475179Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8475429Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3449 seconds precompiling for 13 choices 2025-12-04T09:15:00.8475581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8475707Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8475776Z unimplemented [] 2025-12-04T09:15:00.8475889Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8476076Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8477253Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8477323Z graph_break [] 2025-12-04T09:15:00.8477453Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8477530Z Autotune Choices Stats: 2025-12-04T09:15:00.8478922Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.8479173Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8479393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8479730Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8480854Z triton_flex_attention_194 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8482044Z triton_flex_attention_195 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8483185Z triton_flex_attention_192 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8484333Z triton_flex_attention_193 0.0102 ms 90.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8485439Z triton_flex_attention_190 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8486562Z triton_flex_attention_191 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8486807Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3094 seconds precompiling for 6 choices 2025-12-04T09:15:00.8486883Z Autotune Choices Stats: 2025-12-04T09:15:00.8488303Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8488744Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8489067Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8489691Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8490884Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8492083Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8493320Z triton_flex_attention_backward_197 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8494478Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8495635Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8496782Z triton_flex_attention_backward_200 0.0124 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8497935Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8499148Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8500328Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8501511Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8501756Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.3930 seconds precompiling for 13 choices 2025-12-04T09:15:00.8501891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8501962Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8502025Z unimplemented [] 2025-12-04T09:15:00.8502135Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8502319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8503494Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8503563Z graph_break [] 2025-12-04T09:15:00.8503690Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8503765Z Autotune Choices Stats: 2025-12-04T09:15:00.8505145Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008063999935984612, "best_triton_pos": 0} 2025-12-04T09:15:00.8505392Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8505609Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8505963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8507124Z triton_flex_attention_213 0.0081 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8508278Z triton_flex_attention_214 0.0082 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8509419Z triton_flex_attention_211 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8510532Z triton_flex_attention_212 0.0102 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8511643Z triton_flex_attention_209 0.0112 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8512755Z triton_flex_attention_210 0.0133 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8512997Z SingleProcess AUTOTUNE benchmarking takes 0.2904 seconds and 1.3148 seconds precompiling for 6 choices 2025-12-04T09:15:00.8513070Z Autotune Choices Stats: 2025-12-04T09:15:00.8514494Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8514973Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8515307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8515939Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8517099Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8518281Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8519427Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8520585Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8521735Z triton_flex_attention_backward_219 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8522881Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8524063Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8525245Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8526432Z triton_flex_attention_backward_224 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8527617Z triton_flex_attention_backward_227 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8527862Z SingleProcess AUTOTUNE benchmarking takes 0.6619 seconds and 2.3746 seconds precompiling for 13 choices 2025-12-04T09:15:00.8527990Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8528067Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8528131Z unimplemented [] 2025-12-04T09:15:00.8528243Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8528437Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8529610Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8529680Z graph_break [] 2025-12-04T09:15:00.8529807Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8529887Z Autotune Choices Stats: 2025-12-04T09:15:00.8531280Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8531568Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8531781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8532100Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8533319Z triton_flex_attention_232 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8534652Z triton_flex_attention_233 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8535795Z triton_flex_attention_228 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8536929Z triton_flex_attention_230 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8538043Z triton_flex_attention_231 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8539164Z triton_flex_attention_229 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8539410Z SingleProcess AUTOTUNE benchmarking takes 0.2900 seconds and 1.2906 seconds precompiling for 6 choices 2025-12-04T09:15:00.8539482Z Autotune Choices Stats: 2025-12-04T09:15:00.8540905Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.8541429Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8541786Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8542353Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8543547Z triton_flex_attention_backward_234 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8544707Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8545862Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8547022Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8548178Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8549322Z triton_flex_attention_backward_240 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8550540Z triton_flex_attention_backward_241 0.0134 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8551718Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8552970Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8554124Z triton_flex_attention_backward_246 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8554372Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2768 seconds precompiling for 13 choices 2025-12-04T09:15:00.8554508Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8554584Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8554646Z unimplemented [] 2025-12-04T09:15:00.8554758Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8554945Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8556120Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8556192Z graph_break [] 2025-12-04T09:15:00.8556320Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8556395Z Autotune Choices Stats: 2025-12-04T09:15:00.8557783Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8558068Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8558354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8558672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8559798Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8560945Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8562238Z triton_flex_attention_250 0.0111 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8563369Z triton_flex_attention_247 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8564482Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8565607Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8565924Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2952 seconds precompiling for 6 choices 2025-12-04T09:15:00.8565999Z Autotune Choices Stats: 2025-12-04T09:15:00.8567482Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_255", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8567966Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8568291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8568862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8570080Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8571238Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8572427Z triton_flex_attention_backward_254 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8573587Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8574756Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8575991Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8577170Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8578351Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8579498Z triton_flex_attention_backward_262 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8580668Z triton_flex_attention_backward_265 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8582176Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.3139 seconds precompiling for 13 choices 2025-12-04T09:15:00.8582650Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8582946Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8583153Z unimplemented [] 2025-12-04T09:15:00.8583360Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8583731Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8585186Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8586532Z graph_break [] 2025-12-04T09:15:00.8586765Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8587046Z Autotune Choices Stats: 2025-12-04T09:15:00.8588594Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8590329Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8590881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8591501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8593065Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8595390Z triton_flex_attention_271 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8597699Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8599993Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8602283Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8604575Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8606041Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T09:15:00.8606427Z Autotune Choices Stats: 2025-12-04T09:15:00.8608375Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.8610373Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8611258Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8612228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8614161Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8616546Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8618927Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8621303Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8623742Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8626148Z triton_flex_attention_backward_278 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8628573Z triton_flex_attention_backward_279 0.0136 ms 90.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8630954Z triton_flex_attention_backward_276 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8633332Z triton_flex_attention_backward_281 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8635701Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8637169Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.2940 seconds precompiling for 13 choices 2025-12-04T09:15:00.8637625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8637924Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8638127Z unimplemented [] 2025-12-04T09:15:00.8638328Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8638703Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8640157Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8641507Z graph_break [] 2025-12-04T09:15:00.8641729Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8642005Z Autotune Choices Stats: 2025-12-04T09:15:00.8643550Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8645277Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8645820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8646470Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8647997Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8650309Z triton_flex_attention_290 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8652666Z triton_flex_attention_287 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8654972Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8657277Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8659641Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8661102Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.3003 seconds precompiling for 6 choices 2025-12-04T09:15:00.8661490Z Autotune Choices Stats: 2025-12-04T09:15:00.8663215Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_294", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.012160000391304493, "best_triton_pos": 0} 2025-12-04T09:15:00.8665224Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8666064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8667033Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8668873Z triton_flex_attention_backward_294 0.0122 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8671260Z triton_flex_attention_backward_291 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8673644Z triton_flex_attention_backward_292 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8676092Z triton_flex_attention_backward_293 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8678504Z triton_flex_attention_backward_296 0.0123 ms 99.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8680923Z triton_flex_attention_backward_295 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8683325Z triton_flex_attention_backward_297 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8685703Z triton_flex_attention_backward_298 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8688078Z triton_flex_attention_backward_300 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8690447Z triton_flex_attention_backward_303 0.0143 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8691920Z SingleProcess AUTOTUNE benchmarking takes 0.6635 seconds and 2.2523 seconds precompiling for 13 choices 2025-12-04T09:15:00.8692431Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8692726Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8692964Z unimplemented [] 2025-12-04T09:15:00.8693172Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8693531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8695015Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8696355Z graph_break [] 2025-12-04T09:15:00.8696579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8696854Z Autotune Choices Stats: 2025-12-04T09:15:00.8698354Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_308", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8700091Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8700631Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8701243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8702769Z triton_flex_attention_308 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8705076Z triton_flex_attention_309 0.0082 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8707377Z triton_flex_attention_307 0.0112 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8709689Z triton_flex_attention_304 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8712052Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8714386Z triton_flex_attention_305 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8715814Z SingleProcess AUTOTUNE benchmarking takes 0.2924 seconds and 1.4096 seconds precompiling for 6 choices 2025-12-04T09:15:00.8716195Z Autotune Choices Stats: 2025-12-04T09:15:00.8717761Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8719690Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8720520Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8721484Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8723305Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8725698Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8728080Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8730535Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8733005Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8735414Z triton_flex_attention_backward_316 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8737781Z triton_flex_attention_backward_317 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8740158Z triton_flex_attention_backward_314 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8742521Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8744892Z triton_flex_attention_backward_318 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8746393Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.3225 seconds precompiling for 13 choices 2025-12-04T09:15:00.8746850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8747133Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8747323Z unimplemented [] 2025-12-04T09:15:00.8747536Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8747938Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8749431Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8750742Z graph_break [] 2025-12-04T09:15:00.8750960Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8751239Z Autotune Choices Stats: 2025-12-04T09:15:00.8752776Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.8754491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8755033Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8755643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8757170Z triton_flex_attention_327 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8759467Z triton_flex_attention_328 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8761763Z triton_flex_attention_325 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8764321Z triton_flex_attention_326 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8766700Z triton_flex_attention_323 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8769061Z triton_flex_attention_324 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8770498Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T09:15:00.8770937Z Autotune Choices Stats: 2025-12-04T09:15:00.8772542Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8774477Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8775311Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8776275Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8778088Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8780479Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8782947Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8785342Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8787736Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8790110Z triton_flex_attention_backward_336 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8792480Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8794842Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8797204Z triton_flex_attention_backward_338 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8799568Z triton_flex_attention_backward_341 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8801082Z SingleProcess AUTOTUNE benchmarking takes 0.6633 seconds and 2.3668 seconds precompiling for 13 choices 2025-12-04T09:15:00.8801651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8801935Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8802130Z unimplemented [] 2025-12-04T09:15:00.8802327Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8802694Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8804142Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8805436Z graph_break [] 2025-12-04T09:15:00.8805705Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8805994Z Autotune Choices Stats: 2025-12-04T09:15:00.8807504Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8809203Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8809759Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8810382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8811899Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8814248Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8816537Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8818912Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8821236Z triton_flex_attention_345 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8823559Z triton_flex_attention_343 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8824993Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2820 seconds precompiling for 6 choices 2025-12-04T09:15:00.8825378Z Autotune Choices Stats: 2025-12-04T09:15:00.8826918Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_349", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8828832Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8829663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8830624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8832432Z triton_flex_attention_backward_349 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8834873Z triton_flex_attention_backward_348 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8837270Z triton_flex_attention_backward_350 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8839685Z triton_flex_attention_backward_351 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8842079Z triton_flex_attention_backward_353 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8844443Z triton_flex_attention_backward_355 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8846811Z triton_flex_attention_backward_352 0.0143 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8849187Z triton_flex_attention_backward_354 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8851548Z triton_flex_attention_backward_357 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8854066Z triton_flex_attention_backward_360 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8855585Z SingleProcess AUTOTUNE benchmarking takes 0.6637 seconds and 2.2959 seconds precompiling for 13 choices 2025-12-04T09:15:00.8856044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8856319Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8856512Z unimplemented [] 2025-12-04T09:15:00.8856720Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8857092Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8858571Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8859878Z graph_break [] 2025-12-04T09:15:00.8860100Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8860384Z Autotune Choices Stats: 2025-12-04T09:15:00.8862019Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8863737Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8864281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8864896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8866422Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8868735Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8871167Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8873524Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8875878Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8878183Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8879619Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.3270 seconds precompiling for 6 choices 2025-12-04T09:15:00.8880013Z Autotune Choices Stats: 2025-12-04T09:15:00.8881555Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_368", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8883484Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8884330Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8885315Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8887129Z triton_flex_attention_backward_368 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8889586Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8891997Z triton_flex_attention_backward_369 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8894531Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8896918Z triton_flex_attention_backward_372 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8899291Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8901666Z triton_flex_attention_backward_374 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8904036Z triton_flex_attention_backward_371 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8906489Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8908885Z triton_flex_attention_backward_379 0.0146 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.8910359Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.4637 seconds precompiling for 13 choices 2025-12-04T09:15:00.8910821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8911106Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8911295Z unimplemented [] 2025-12-04T09:15:00.8911500Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8911905Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8913345Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8914644Z graph_break [] 2025-12-04T09:15:00.8914860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8915138Z Autotune Choices Stats: 2025-12-04T09:15:00.8916643Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8918354Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8918901Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8919513Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8921040Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8923421Z triton_flex_attention_385 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8925755Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8928053Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8930396Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8932784Z triton_flex_attention_381 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8934231Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3242 seconds precompiling for 6 choices 2025-12-04T09:15:00.8934628Z Autotune Choices Stats: 2025-12-04T09:15:00.8936178Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.8938105Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8938936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8939941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8941794Z triton_flex_attention_backward_386 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8944214Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8946634Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8949045Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8951427Z triton_flex_attention_backward_391 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8953803Z triton_flex_attention_backward_392 0.0134 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.8956179Z triton_flex_attention_backward_390 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8958603Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.8961019Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8963654Z triton_flex_attention_backward_394 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.8965194Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.2524 seconds precompiling for 13 choices 2025-12-04T09:15:00.8965652Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.8965944Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.8966140Z unimplemented [] 2025-12-04T09:15:00.8966345Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.8966706Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.8968154Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.8969467Z graph_break [] 2025-12-04T09:15:00.8969697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.8969972Z Autotune Choices Stats: 2025-12-04T09:15:00.8971480Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.8973226Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8973772Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8974383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8975971Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8978333Z triton_flex_attention_404 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8980687Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8983007Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.8985302Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.8987599Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8989034Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3374 seconds precompiling for 6 choices 2025-12-04T09:15:00.8989415Z Autotune Choices Stats: 2025-12-04T09:15:00.8990958Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_406", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.8992882Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.8993753Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.8994754Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.8996603Z triton_flex_attention_backward_406 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.8999030Z triton_flex_attention_backward_405 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9001409Z triton_flex_attention_backward_407 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9003783Z triton_flex_attention_backward_408 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9006175Z triton_flex_attention_backward_410 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9008558Z triton_flex_attention_backward_409 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9010946Z triton_flex_attention_backward_411 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9013437Z triton_flex_attention_backward_412 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9015849Z triton_flex_attention_backward_413 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9018260Z triton_flex_attention_backward_414 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9019747Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.4031 seconds precompiling for 13 choices 2025-12-04T09:15:00.9020207Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9020495Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9020688Z unimplemented [] 2025-12-04T09:15:00.9020897Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9021263Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9022706Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9024011Z graph_break [] 2025-12-04T09:15:00.9024243Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9024527Z Autotune Choices Stats: 2025-12-04T09:15:00.9026037Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.9027730Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9028302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9028918Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9030496Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9032848Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9035184Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9037484Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9039791Z triton_flex_attention_418 0.0123 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9042091Z triton_flex_attention_419 0.0142 ms 57.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9043526Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3073 seconds precompiling for 6 choices 2025-12-04T09:15:00.9043914Z Autotune Choices Stats: 2025-12-04T09:15:00.9045478Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_426", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012256000190973282, "best_triton_pos": 0} 2025-12-04T09:15:00.9047443Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9048319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9049319Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9051138Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9053623Z triton_flex_attention_backward_424 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9056009Z triton_flex_attention_backward_425 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9058385Z triton_flex_attention_backward_427 0.0123 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9060771Z triton_flex_attention_backward_429 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9063329Z triton_flex_attention_backward_431 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9065837Z triton_flex_attention_backward_430 0.0141 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9068266Z triton_flex_attention_backward_428 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9070693Z triton_flex_attention_backward_433 0.0143 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9073233Z triton_flex_attention_backward_436 0.0144 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9074703Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.3392 seconds precompiling for 13 choices 2025-12-04T09:15:00.9075156Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9075438Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9075636Z unimplemented [] 2025-12-04T09:15:00.9075837Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9076215Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9077649Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9078955Z graph_break [] 2025-12-04T09:15:00.9079178Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9079466Z Autotune Choices Stats: 2025-12-04T09:15:00.9080983Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.9082720Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9083262Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9083953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9085523Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9087861Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9090166Z triton_flex_attention_439 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9092510Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9094824Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9097124Z triton_flex_attention_438 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9098555Z SingleProcess AUTOTUNE benchmarking takes 0.2906 seconds and 1.4377 seconds precompiling for 6 choices 2025-12-04T09:15:00.9098940Z Autotune Choices Stats: 2025-12-04T09:15:00.9100580Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_445", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9102529Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9103360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9104323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9106163Z triton_flex_attention_backward_445 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9108564Z triton_flex_attention_backward_446 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9110958Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9113348Z triton_flex_attention_backward_444 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9115733Z triton_flex_attention_backward_447 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9118184Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9120589Z triton_flex_attention_backward_450 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9123001Z triton_flex_attention_backward_449 0.0135 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9125366Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9127739Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9129225Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.3278 seconds precompiling for 13 choices 2025-12-04T09:15:00.9129675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9129962Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9130159Z unimplemented [] 2025-12-04T09:15:00.9130365Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9130727Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9132170Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9133579Z graph_break [] 2025-12-04T09:15:00.9133805Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9134078Z Autotune Choices Stats: 2025-12-04T09:15:00.9135665Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T09:15:00.9137402Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9137942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9138557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9140091Z triton_flex_attention_460 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9142435Z triton_flex_attention_461 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9144744Z triton_flex_attention_459 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9147051Z triton_flex_attention_456 0.0114 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9149358Z triton_flex_attention_458 0.0123 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9151677Z triton_flex_attention_457 0.0133 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9153160Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.4513 seconds precompiling for 6 choices 2025-12-04T09:15:00.9153542Z Autotune Choices Stats: 2025-12-04T09:15:00.9155120Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9157085Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9157924Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9158947Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9160769Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9163337Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9165732Z triton_flex_attention_backward_462 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9168112Z triton_flex_attention_backward_464 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9170496Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9173056Z triton_flex_attention_backward_468 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9175489Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9177905Z triton_flex_attention_backward_466 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9180276Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9182651Z triton_flex_attention_backward_470 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9184128Z SingleProcess AUTOTUNE benchmarking takes 0.6629 seconds and 2.2662 seconds precompiling for 13 choices 2025-12-04T09:15:00.9184590Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9184870Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9185072Z unimplemented [] 2025-12-04T09:15:00.9185283Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9185649Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9187100Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9188447Z graph_break [] 2025-12-04T09:15:00.9188676Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9188957Z Autotune Choices Stats: 2025-12-04T09:15:00.9190505Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.007968000136315823, "best_triton_pos": 0} 2025-12-04T09:15:00.9192230Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9192764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9193395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9194967Z triton_flex_attention_479 0.0080 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9197280Z triton_flex_attention_480 0.0082 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9199584Z triton_flex_attention_477 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9201899Z triton_flex_attention_478 0.0102 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9204203Z triton_flex_attention_475 0.0113 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9206534Z triton_flex_attention_476 0.0133 ms 59.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9208003Z SingleProcess AUTOTUNE benchmarking takes 0.2917 seconds and 1.3266 seconds precompiling for 6 choices 2025-12-04T09:15:00.9208424Z Autotune Choices Stats: 2025-12-04T09:15:00.9209973Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_481", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9211907Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9212828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9213798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9215620Z triton_flex_attention_backward_481 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9218015Z triton_flex_attention_backward_482 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9220416Z triton_flex_attention_backward_483 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9222812Z triton_flex_attention_backward_484 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9225273Z triton_flex_attention_backward_485 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9227704Z triton_flex_attention_backward_486 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9230139Z triton_flex_attention_backward_487 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9232520Z triton_flex_attention_backward_488 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9234900Z triton_flex_attention_backward_490 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9237267Z triton_flex_attention_backward_493 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9238736Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.2475 seconds precompiling for 13 choices 2025-12-04T09:15:00.9239198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9239493Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9239703Z unimplemented [] 2025-12-04T09:15:00.9239901Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9240262Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9241779Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9243075Z graph_break [] 2025-12-04T09:15:00.9243379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9243657Z Autotune Choices Stats: 2025-12-04T09:15:00.9245166Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.9246979Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9247557Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9248175Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9249707Z triton_flex_attention_498 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9252011Z triton_flex_attention_499 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9254415Z triton_flex_attention_494 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9256712Z triton_flex_attention_496 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9259014Z triton_flex_attention_497 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9261395Z triton_flex_attention_495 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9263034Z SingleProcess AUTOTUNE benchmarking takes 0.2916 seconds and 1.3152 seconds precompiling for 6 choices 2025-12-04T09:15:00.9263433Z Autotune Choices Stats: 2025-12-04T09:15:00.9265048Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_500", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.9266979Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9267817Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9268787Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9270599Z triton_flex_attention_backward_500 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9272987Z triton_flex_attention_backward_501 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9275375Z triton_flex_attention_backward_502 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9277853Z triton_flex_attention_backward_503 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9280283Z triton_flex_attention_backward_505 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9282705Z triton_flex_attention_backward_507 0.0133 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9285091Z triton_flex_attention_backward_506 0.0143 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9287485Z triton_flex_attention_backward_504 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9289875Z triton_flex_attention_backward_509 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9292261Z triton_flex_attention_backward_512 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9293798Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2733 seconds precompiling for 13 choices 2025-12-04T09:15:00.9294302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9294581Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9294784Z unimplemented [] 2025-12-04T09:15:00.9294995Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9295355Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9296857Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9298194Z graph_break [] 2025-12-04T09:15:00.9298422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9298703Z Autotune Choices Stats: 2025-12-04T09:15:00.9300252Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T09:15:00.9301956Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9302496Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9303120Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9304652Z triton_flex_attention_517 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9306953Z triton_flex_attention_518 0.0084 ms 97.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9309246Z triton_flex_attention_513 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9311538Z triton_flex_attention_515 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9313904Z triton_flex_attention_516 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9316241Z triton_flex_attention_514 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9317681Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3311 seconds precompiling for 6 choices 2025-12-04T09:15:00.9318062Z Autotune Choices Stats: 2025-12-04T09:15:00.9319650Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_519", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T09:15:00.9321570Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9322402Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9323375Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9325184Z triton_flex_attention_backward_519 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9327577Z triton_flex_attention_backward_520 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9330016Z triton_flex_attention_backward_521 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9332532Z triton_flex_attention_backward_522 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9334975Z triton_flex_attention_backward_524 0.0132 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9337374Z triton_flex_attention_backward_523 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9339753Z triton_flex_attention_backward_525 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9340913Z triton_flex_attention_backward_526 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9342062Z triton_flex_attention_backward_528 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9343218Z triton_flex_attention_backward_531 0.0153 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9343499Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2909 seconds precompiling for 13 choices 2025-12-04T09:15:00.9343642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9343715Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9343815Z unimplemented [] 2025-12-04T09:15:00.9343955Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9344141Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9345324Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9345389Z graph_break [] 2025-12-04T09:15:00.9345524Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9345593Z Autotune Choices Stats: 2025-12-04T09:15:00.9347025Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9347283Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9347503Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9347832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9348959Z triton_flex_attention_536 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9350084Z triton_flex_attention_537 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9351205Z triton_flex_attention_532 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9352458Z triton_flex_attention_534 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9353610Z triton_flex_attention_535 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9354761Z triton_flex_attention_533 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9355014Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T09:15:00.9355081Z Autotune Choices Stats: 2025-12-04T09:15:00.9356517Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_538", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.9356960Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9357297Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9357864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9359044Z triton_flex_attention_backward_538 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9360204Z triton_flex_attention_backward_539 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9361432Z triton_flex_attention_backward_540 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9362771Z triton_flex_attention_backward_541 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9364008Z triton_flex_attention_backward_542 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9365187Z triton_flex_attention_backward_543 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9366337Z triton_flex_attention_backward_545 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9367491Z triton_flex_attention_backward_544 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9368641Z triton_flex_attention_backward_547 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9369886Z triton_flex_attention_backward_550 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9370470Z SingleProcess AUTOTUNE benchmarking takes 0.6681 seconds and 2.3456 seconds precompiling for 13 choices 2025-12-04T09:15:00.9370609Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9370678Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9370747Z unimplemented [] 2025-12-04T09:15:00.9370854Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9371041Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9372260Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9372376Z graph_break [] 2025-12-04T09:15:00.9372515Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9372584Z Autotune Choices Stats: 2025-12-04T09:15:00.9373975Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9374229Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9374451Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9374771Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9375897Z triton_flex_attention_555 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9377015Z triton_flex_attention_556 0.0112 ms 91.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9378177Z triton_flex_attention_553 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9379329Z triton_flex_attention_554 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9380476Z triton_flex_attention_551 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9381641Z triton_flex_attention_552 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9381893Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3230 seconds precompiling for 6 choices 2025-12-04T09:15:00.9381961Z Autotune Choices Stats: 2025-12-04T09:15:00.9383398Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9383832Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9384167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9384737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9385916Z triton_flex_attention_backward_559 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9387145Z triton_flex_attention_backward_560 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9388343Z triton_flex_attention_backward_558 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9389530Z triton_flex_attention_backward_557 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9390683Z triton_flex_attention_backward_562 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9391831Z triton_flex_attention_backward_561 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9392991Z triton_flex_attention_backward_563 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9394148Z triton_flex_attention_backward_564 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9395297Z triton_flex_attention_backward_569 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9396517Z triton_flex_attention_backward_566 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9396793Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2862 seconds precompiling for 13 choices 2025-12-04T09:15:00.9396929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9397001Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9397075Z unimplemented [] 2025-12-04T09:15:00.9397189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9397377Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9398591Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9398660Z graph_break [] 2025-12-04T09:15:00.9398797Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9398864Z Autotune Choices Stats: 2025-12-04T09:15:00.9400266Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_575", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9400518Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9400742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9401067Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9402197Z triton_flex_attention_575 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9403320Z triton_flex_attention_574 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9404509Z triton_flex_attention_572 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9405658Z triton_flex_attention_573 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9406807Z triton_flex_attention_570 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9407921Z triton_flex_attention_571 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9414617Z SingleProcess AUTOTUNE benchmarking takes 0.2930 seconds and 1.2848 seconds precompiling for 6 choices 2025-12-04T09:15:00.9414748Z Autotune Choices Stats: 2025-12-04T09:15:00.9416235Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014271999709308147, "best_triton_pos": 0} 2025-12-04T09:15:00.9416694Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9417023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9417597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9418892Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9420174Z triton_flex_attention_backward_577 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9421385Z triton_flex_attention_backward_578 0.0143 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9422545Z triton_flex_attention_backward_576 0.0145 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9423714Z triton_flex_attention_backward_580 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9424873Z triton_flex_attention_backward_581 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9426035Z triton_flex_attention_backward_583 0.0155 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9427198Z triton_flex_attention_backward_582 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9428409Z triton_flex_attention_backward_585 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9429594Z triton_flex_attention_backward_588 0.0164 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9429851Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.2637 seconds precompiling for 13 choices 2025-12-04T09:15:00.9429999Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9430109Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9430176Z unimplemented [] 2025-12-04T09:15:00.9430294Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9430486Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9431668Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9431737Z graph_break [] 2025-12-04T09:15:00.9431871Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9431959Z Autotune Choices Stats: 2025-12-04T09:15:00.9433371Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9433628Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9433851Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9434181Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9435314Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9436518Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9437669Z triton_flex_attention_589 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9438825Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9439951Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9441076Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9441330Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3283 seconds precompiling for 6 choices 2025-12-04T09:15:00.9441399Z Autotune Choices Stats: 2025-12-04T09:15:00.9442832Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.9443273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9443603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9444210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9445406Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9446593Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9447777Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9448932Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9450099Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9451251Z triton_flex_attention_backward_604 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9452508Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9453763Z triton_flex_attention_backward_601 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9454944Z triton_flex_attention_backward_602 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9456132Z triton_flex_attention_backward_607 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9456382Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.3815 seconds precompiling for 13 choices 2025-12-04T09:15:00.9456521Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9456599Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9456663Z unimplemented [] 2025-12-04T09:15:00.9456777Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9456962Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9458147Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9458211Z graph_break [] 2025-12-04T09:15:00.9458342Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9458418Z Autotune Choices Stats: 2025-12-04T09:15:00.9459807Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9460061Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9460283Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9460656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9462053Z triton_flex_attention_612 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9463284Z triton_flex_attention_613 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9464467Z triton_flex_attention_610 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9465593Z triton_flex_attention_608 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9466713Z triton_flex_attention_611 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9467830Z triton_flex_attention_609 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9468081Z SingleProcess AUTOTUNE benchmarking takes 0.2948 seconds and 1.3336 seconds precompiling for 6 choices 2025-12-04T09:15:00.9468158Z Autotune Choices Stats: 2025-12-04T09:15:00.9469587Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.9470082Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9470410Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9471013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9472224Z triton_flex_attention_backward_614 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9473420Z triton_flex_attention_backward_615 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9474586Z triton_flex_attention_backward_616 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9475743Z triton_flex_attention_backward_617 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9476901Z triton_flex_attention_backward_619 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9478055Z triton_flex_attention_backward_620 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9479243Z triton_flex_attention_backward_621 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9480424Z triton_flex_attention_backward_618 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9481617Z triton_flex_attention_backward_623 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9482811Z triton_flex_attention_backward_626 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9483060Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.3094 seconds precompiling for 13 choices 2025-12-04T09:15:00.9483201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9483273Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9483338Z unimplemented [] 2025-12-04T09:15:00.9483462Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9483655Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9484843Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9484908Z graph_break [] 2025-12-04T09:15:00.9485040Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9485119Z Autotune Choices Stats: 2025-12-04T09:15:00.9486512Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9486820Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9487040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9487365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9488528Z triton_flex_attention_631 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9489688Z triton_flex_attention_632 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9490884Z triton_flex_attention_627 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9492017Z triton_flex_attention_629 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9493195Z triton_flex_attention_630 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9494323Z triton_flex_attention_628 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9494572Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3035 seconds precompiling for 6 choices 2025-12-04T09:15:00.9494643Z Autotune Choices Stats: 2025-12-04T09:15:00.9496073Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9496587Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9496944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9497510Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9498710Z triton_flex_attention_backward_634 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9499876Z triton_flex_attention_backward_635 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9501044Z triton_flex_attention_backward_636 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9502198Z triton_flex_attention_backward_633 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9503358Z triton_flex_attention_backward_638 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9504513Z triton_flex_attention_backward_640 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9505733Z triton_flex_attention_backward_637 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9506911Z triton_flex_attention_backward_639 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9508095Z triton_flex_attention_backward_642 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9509256Z triton_flex_attention_backward_645 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9509502Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.2461 seconds precompiling for 13 choices 2025-12-04T09:15:00.9509657Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9509729Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9509793Z unimplemented [] 2025-12-04T09:15:00.9509903Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9510091Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9511258Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9511330Z graph_break [] 2025-12-04T09:15:00.9511460Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9511534Z Autotune Choices Stats: 2025-12-04T09:15:00.9512926Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9513213Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9513463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9513818Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9514946Z triton_flex_attention_650 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9516118Z triton_flex_attention_651 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9517240Z triton_flex_attention_648 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9518364Z triton_flex_attention_646 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9519483Z triton_flex_attention_649 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9520606Z triton_flex_attention_647 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9520884Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3188 seconds precompiling for 6 choices 2025-12-04T09:15:00.9520958Z Autotune Choices Stats: 2025-12-04T09:15:00.9522410Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_654", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9522881Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9523211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9523779Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9524977Z triton_flex_attention_backward_654 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9526141Z triton_flex_attention_backward_652 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9527294Z triton_flex_attention_backward_653 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9528450Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9529610Z triton_flex_attention_backward_657 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9530826Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9532008Z triton_flex_attention_backward_658 0.0156 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9533251Z triton_flex_attention_backward_656 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9534410Z triton_flex_attention_backward_661 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9535569Z triton_flex_attention_backward_664 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9535821Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3325 seconds precompiling for 13 choices 2025-12-04T09:15:00.9535963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9536035Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9536099Z unimplemented [] 2025-12-04T09:15:00.9536208Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9536397Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9537572Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9537681Z graph_break [] 2025-12-04T09:15:00.9537810Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9537883Z Autotune Choices Stats: 2025-12-04T09:15:00.9539304Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9539586Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9539808Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9540135Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9541297Z triton_flex_attention_669 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9542429Z triton_flex_attention_670 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9543546Z triton_flex_attention_667 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9544661Z triton_flex_attention_665 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9545780Z triton_flex_attention_668 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9546922Z triton_flex_attention_666 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9547206Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.4095 seconds precompiling for 6 choices 2025-12-04T09:15:00.9547282Z Autotune Choices Stats: 2025-12-04T09:15:00.9548744Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012512000277638435, "best_triton_pos": 0} 2025-12-04T09:15:00.9549233Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9549591Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9550159Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9551326Z triton_flex_attention_backward_673 0.0125 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9552490Z triton_flex_attention_backward_672 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9553637Z triton_flex_attention_backward_674 0.0133 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9554796Z triton_flex_attention_backward_671 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9556017Z triton_flex_attention_backward_676 0.0143 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9557195Z triton_flex_attention_backward_675 0.0152 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9558377Z triton_flex_attention_backward_677 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9559521Z triton_flex_attention_backward_678 0.0154 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9560679Z triton_flex_attention_backward_680 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9562027Z triton_flex_attention_backward_683 0.0164 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9562299Z SingleProcess AUTOTUNE benchmarking takes 0.6665 seconds and 2.3098 seconds precompiling for 13 choices 2025-12-04T09:15:00.9562440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9562515Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9562591Z unimplemented [] 2025-12-04T09:15:00.9562703Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9562896Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9564080Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9564218Z graph_break [] 2025-12-04T09:15:00.9564349Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9564431Z Autotune Choices Stats: 2025-12-04T09:15:00.9565884Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_688", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9566182Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9566403Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9566765Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9567912Z triton_flex_attention_688 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9569038Z triton_flex_attention_689 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9570160Z triton_flex_attention_686 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9571286Z triton_flex_attention_684 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9572458Z triton_flex_attention_687 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9573657Z triton_flex_attention_685 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9573937Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3382 seconds precompiling for 6 choices 2025-12-04T09:15:00.9574010Z Autotune Choices Stats: 2025-12-04T09:15:00.9575444Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_693", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9575923Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9576255Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9576826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9578000Z triton_flex_attention_backward_693 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9579166Z triton_flex_attention_backward_692 0.0135 ms 98.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9580323Z triton_flex_attention_backward_691 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9581496Z triton_flex_attention_backward_695 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9582744Z triton_flex_attention_backward_690 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9583933Z triton_flex_attention_backward_696 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9585116Z triton_flex_attention_backward_697 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9586279Z triton_flex_attention_backward_699 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9587443Z triton_flex_attention_backward_694 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9588606Z triton_flex_attention_backward_702 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9588853Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3646 seconds precompiling for 13 choices 2025-12-04T09:15:00.9588983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9589062Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9589158Z unimplemented [] 2025-12-04T09:15:00.9589267Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9589454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9590660Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9590760Z graph_break [] 2025-12-04T09:15:00.9590889Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9590962Z Autotune Choices Stats: 2025-12-04T09:15:00.9592351Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9592635Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9592855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9593172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9594311Z triton_flex_attention_707 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9595434Z triton_flex_attention_708 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9596549Z triton_flex_attention_703 0.0134 ms 76.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9597669Z triton_flex_attention_705 0.0135 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9598849Z triton_flex_attention_706 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9600003Z triton_flex_attention_704 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9600254Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3640 seconds precompiling for 6 choices 2025-12-04T09:15:00.9600329Z Autotune Choices Stats: 2025-12-04T09:15:00.9601791Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_712", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9602235Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9602562Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9603136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9604306Z triton_flex_attention_backward_712 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9605468Z triton_flex_attention_backward_710 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9606624Z triton_flex_attention_backward_711 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9607869Z triton_flex_attention_backward_709 0.0153 ms 86.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9609058Z triton_flex_attention_backward_714 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9610247Z triton_flex_attention_backward_716 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9611401Z triton_flex_attention_backward_713 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9612606Z triton_flex_attention_backward_715 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9613770Z triton_flex_attention_backward_718 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9614925Z triton_flex_attention_backward_721 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9615241Z SingleProcess AUTOTUNE benchmarking takes 0.6684 seconds and 2.3297 seconds precompiling for 13 choices 2025-12-04T09:15:00.9615377Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9615458Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9615525Z unimplemented [] 2025-12-04T09:15:00.9615630Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9615829Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9617072Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9617143Z graph_break [] 2025-12-04T09:15:00.9617273Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9617358Z Autotune Choices Stats: 2025-12-04T09:15:00.9618782Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9619037Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9619259Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9619576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9620713Z triton_flex_attention_726 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9621830Z triton_flex_attention_727 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9622955Z triton_flex_attention_724 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9624106Z triton_flex_attention_722 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9625252Z triton_flex_attention_725 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9626474Z triton_flex_attention_723 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9626723Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3261 seconds precompiling for 6 choices 2025-12-04T09:15:00.9626839Z Autotune Choices Stats: 2025-12-04T09:15:00.9628275Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_730", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9628720Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9629049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9629621Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9630792Z triton_flex_attention_backward_730 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9631957Z triton_flex_attention_backward_729 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9633174Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9634367Z triton_flex_attention_backward_728 0.0152 ms 87.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9635557Z triton_flex_attention_backward_733 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9636705Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9637867Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9639020Z triton_flex_attention_backward_735 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9640179Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9641341Z triton_flex_attention_backward_736 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9641620Z SingleProcess AUTOTUNE benchmarking takes 0.6686 seconds and 2.3200 seconds precompiling for 13 choices 2025-12-04T09:15:00.9641780Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9641886Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9641950Z unimplemented [] 2025-12-04T09:15:00.9642055Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9642247Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9643421Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9643490Z graph_break [] 2025-12-04T09:15:00.9643619Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9643742Z Autotune Choices Stats: 2025-12-04T09:15:00.9645132Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9645385Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9645610Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9645928Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9647065Z triton_flex_attention_745 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9648183Z triton_flex_attention_746 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9649305Z triton_flex_attention_743 0.0142 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9650490Z triton_flex_attention_741 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9651637Z triton_flex_attention_744 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9652864Z triton_flex_attention_742 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9653115Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2852 seconds precompiling for 6 choices 2025-12-04T09:15:00.9653191Z Autotune Choices Stats: 2025-12-04T09:15:00.9654620Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_748", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9655061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9655386Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9655960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9657133Z triton_flex_attention_backward_748 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9658329Z triton_flex_attention_backward_749 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9659520Z triton_flex_attention_backward_747 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9660713Z triton_flex_attention_backward_750 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9662089Z triton_flex_attention_backward_752 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9663267Z triton_flex_attention_backward_751 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9664428Z triton_flex_attention_backward_753 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9665600Z triton_flex_attention_backward_754 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9666760Z triton_flex_attention_backward_756 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9668016Z triton_flex_attention_backward_759 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9668320Z SingleProcess AUTOTUNE benchmarking takes 0.6747 seconds and 2.3463 seconds precompiling for 13 choices 2025-12-04T09:15:00.9668454Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9668535Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9668600Z unimplemented [] 2025-12-04T09:15:00.9668703Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9668894Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9670107Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9670181Z graph_break [] 2025-12-04T09:15:00.9670308Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9670384Z Autotune Choices Stats: 2025-12-04T09:15:00.9671780Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_764", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9672036Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9672254Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9672570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9673709Z triton_flex_attention_764 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9674827Z triton_flex_attention_765 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9676042Z triton_flex_attention_762 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9677216Z triton_flex_attention_760 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9678370Z triton_flex_attention_763 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9679495Z triton_flex_attention_761 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9679742Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3826 seconds precompiling for 6 choices 2025-12-04T09:15:00.9679815Z Autotune Choices Stats: 2025-12-04T09:15:00.9681241Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_769", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013376000337302685, "best_triton_pos": 0} 2025-12-04T09:15:00.9681680Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9682006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9682578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9683748Z triton_flex_attention_backward_769 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9684984Z triton_flex_attention_backward_767 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9686171Z triton_flex_attention_backward_768 0.0143 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9687363Z triton_flex_attention_backward_766 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9688518Z triton_flex_attention_backward_770 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9689672Z triton_flex_attention_backward_771 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9690828Z triton_flex_attention_backward_772 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9691991Z triton_flex_attention_backward_773 0.0154 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9693308Z triton_flex_attention_backward_775 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9694497Z triton_flex_attention_backward_778 0.0164 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9694756Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.4135 seconds precompiling for 13 choices 2025-12-04T09:15:00.9694886Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9694969Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9695031Z unimplemented [] 2025-12-04T09:15:00.9695134Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9695361Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9696544Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9696613Z graph_break [] 2025-12-04T09:15:00.9696741Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9696808Z Autotune Choices Stats: 2025-12-04T09:15:00.9698201Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9698443Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9698669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9698984Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9700121Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9701268Z triton_flex_attention_784 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9702415Z triton_flex_attention_781 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9703567Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9704712Z triton_flex_attention_782 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9705833Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9706089Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3371 seconds precompiling for 6 choices 2025-12-04T09:15:00.9706164Z Autotune Choices Stats: 2025-12-04T09:15:00.9707592Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_787", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9708035Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9708362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9708969Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9710195Z triton_flex_attention_backward_787 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9711391Z triton_flex_attention_backward_785 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9712573Z triton_flex_attention_backward_786 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9713730Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9714884Z triton_flex_attention_backward_790 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9716040Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9717202Z triton_flex_attention_backward_789 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9718390Z triton_flex_attention_backward_792 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9719579Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9720763Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9721049Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.3976 seconds precompiling for 13 choices 2025-12-04T09:15:00.9721180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9721257Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9721320Z unimplemented [] 2025-12-04T09:15:00.9721421Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9721615Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9722793Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9722864Z graph_break [] 2025-12-04T09:15:00.9722996Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9723063Z Autotune Choices Stats: 2025-12-04T09:15:00.9724457Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9724704Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9724932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9725248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9726383Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9727568Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9728720Z triton_flex_attention_801 0.0143 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9729879Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9730995Z triton_flex_attention_800 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9732133Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9732428Z SingleProcess AUTOTUNE benchmarking takes 0.2967 seconds and 1.3426 seconds precompiling for 6 choices 2025-12-04T09:15:00.9732506Z Autotune Choices Stats: 2025-12-04T09:15:00.9733946Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9734389Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9734754Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9735355Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9736561Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9737751Z triton_flex_attention_backward_807 0.0142 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9738906Z triton_flex_attention_backward_804 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9740079Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9741234Z triton_flex_attention_backward_809 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9742393Z triton_flex_attention_backward_810 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9743553Z triton_flex_attention_backward_813 0.0163 ms 81.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9744786Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9745993Z triton_flex_attention_backward_811 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9747178Z triton_flex_attention_backward_816 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9747430Z SingleProcess AUTOTUNE benchmarking takes 0.6691 seconds and 2.4016 seconds precompiling for 13 choices 2025-12-04T09:15:00.9747559Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9747634Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9747695Z unimplemented [] 2025-12-04T09:15:00.9747796Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9747986Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9749163Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9749240Z graph_break [] 2025-12-04T09:15:00.9749368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9749438Z Autotune Choices Stats: 2025-12-04T09:15:00.9750835Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9751078Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9751351Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9751669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9752835Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9753982Z triton_flex_attention_822 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9755138Z triton_flex_attention_819 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9756253Z triton_flex_attention_820 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9757370Z triton_flex_attention_817 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9758486Z triton_flex_attention_818 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9758730Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.3304 seconds precompiling for 6 choices 2025-12-04T09:15:00.9758801Z Autotune Choices Stats: 2025-12-04T09:15:00.9760229Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_826", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9760704Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9761104Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9761698Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9763032Z triton_flex_attention_backward_826 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9764267Z triton_flex_attention_backward_823 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9765433Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9766592Z triton_flex_attention_backward_825 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9767741Z triton_flex_attention_backward_828 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9768903Z triton_flex_attention_backward_829 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9770150Z triton_flex_attention_backward_830 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9771345Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9772604Z triton_flex_attention_backward_835 0.0173 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9773768Z triton_flex_attention_backward_831 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9774023Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.3188 seconds precompiling for 13 choices 2025-12-04T09:15:00.9774152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9774229Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9774296Z unimplemented [] 2025-12-04T09:15:00.9774399Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9774590Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9775761Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9775829Z graph_break [] 2025-12-04T09:15:00.9775968Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9776037Z Autotune Choices Stats: 2025-12-04T09:15:00.9777430Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9777714Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9777938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9778289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9779449Z triton_flex_attention_840 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9780601Z triton_flex_attention_841 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9781727Z triton_flex_attention_838 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9782840Z triton_flex_attention_836 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9783959Z triton_flex_attention_839 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9785080Z triton_flex_attention_837 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9785325Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3626 seconds precompiling for 6 choices 2025-12-04T09:15:00.9785392Z Autotune Choices Stats: 2025-12-04T09:15:00.9786887Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_844", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9787357Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9787680Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9788248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9789452Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9790622Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9791775Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9792932Z triton_flex_attention_backward_843 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9794084Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9795277Z triton_flex_attention_backward_847 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9796467Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9797645Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9798836Z triton_flex_attention_backward_849 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9799995Z triton_flex_attention_backward_854 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9800248Z SingleProcess AUTOTUNE benchmarking takes 0.6694 seconds and 2.3460 seconds precompiling for 13 choices 2025-12-04T09:15:00.9800380Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9800456Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9800523Z unimplemented [] 2025-12-04T09:15:00.9800626Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9800825Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9802021Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9802097Z graph_break [] 2025-12-04T09:15:00.9802226Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9802297Z Autotune Choices Stats: 2025-12-04T09:15:00.9803778Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T09:15:00.9804072Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9804300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9804614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9805753Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9806898Z triton_flex_attention_860 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9808023Z triton_flex_attention_855 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9809139Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9810268Z triton_flex_attention_858 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9811383Z triton_flex_attention_856 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9811672Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3732 seconds precompiling for 6 choices 2025-12-04T09:15:00.9811739Z Autotune Choices Stats: 2025-12-04T09:15:00.9813262Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9813727Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9814061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9814656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9815832Z triton_flex_attention_backward_862 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9817000Z triton_flex_attention_backward_864 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9818164Z triton_flex_attention_backward_861 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9819324Z triton_flex_attention_backward_863 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9820490Z triton_flex_attention_backward_866 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9821712Z triton_flex_attention_backward_865 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9823079Z triton_flex_attention_backward_867 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9824266Z triton_flex_attention_backward_868 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9825419Z triton_flex_attention_backward_870 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9826571Z triton_flex_attention_backward_873 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9826825Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.4237 seconds precompiling for 13 choices 2025-12-04T09:15:00.9826956Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9827031Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9827108Z unimplemented [] 2025-12-04T09:15:00.9827214Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9827407Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9828581Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9828682Z graph_break [] 2025-12-04T09:15:00.9828809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9828876Z Autotune Choices Stats: 2025-12-04T09:15:00.9830309Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_878", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9830583Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9830804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9831120Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9832292Z triton_flex_attention_878 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9833405Z triton_flex_attention_879 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9834529Z triton_flex_attention_874 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9835652Z triton_flex_attention_876 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9836773Z triton_flex_attention_877 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9837936Z triton_flex_attention_875 0.0145 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9838235Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.4549 seconds precompiling for 6 choices 2025-12-04T09:15:00.9838334Z Autotune Choices Stats: 2025-12-04T09:15:00.9839780Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9840215Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9840583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9841148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9842327Z triton_flex_attention_backward_881 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9843492Z triton_flex_attention_backward_880 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9844648Z triton_flex_attention_backward_882 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9845811Z triton_flex_attention_backward_883 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9847052Z triton_flex_attention_backward_885 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9848258Z triton_flex_attention_backward_884 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9849442Z triton_flex_attention_backward_886 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9850606Z triton_flex_attention_backward_887 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9851766Z triton_flex_attention_backward_889 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9853053Z triton_flex_attention_backward_892 0.0172 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9853308Z SingleProcess AUTOTUNE benchmarking takes 0.6678 seconds and 2.4344 seconds precompiling for 13 choices 2025-12-04T09:15:00.9853443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9853514Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9853586Z unimplemented [] 2025-12-04T09:15:00.9853690Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9853880Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9855100Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9855169Z graph_break [] 2025-12-04T09:15:00.9855331Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9855435Z Autotune Choices Stats: 2025-12-04T09:15:00.9856835Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_897", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9857079Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9857339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9857658Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9858801Z triton_flex_attention_897 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9859922Z triton_flex_attention_898 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9861046Z triton_flex_attention_893 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9862338Z triton_flex_attention_895 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9863541Z triton_flex_attention_896 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9864782Z triton_flex_attention_894 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9865084Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.3455 seconds precompiling for 6 choices 2025-12-04T09:15:00.9865153Z Autotune Choices Stats: 2025-12-04T09:15:00.9866635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_899", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T09:15:00.9867072Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9867407Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9867969Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9869141Z triton_flex_attention_backward_899 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9870301Z triton_flex_attention_backward_900 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9871465Z triton_flex_attention_backward_901 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9872715Z triton_flex_attention_backward_902 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9873900Z triton_flex_attention_backward_904 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9875089Z triton_flex_attention_backward_903 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9876241Z triton_flex_attention_backward_905 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9877402Z triton_flex_attention_backward_906 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9878567Z triton_flex_attention_backward_907 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9879722Z triton_flex_attention_backward_908 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9879978Z SingleProcess AUTOTUNE benchmarking takes 0.6671 seconds and 2.2860 seconds precompiling for 13 choices 2025-12-04T09:15:00.9880147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9880220Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9880288Z unimplemented [] 2025-12-04T09:15:00.9880392Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9880586Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9881793Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9881893Z graph_break [] 2025-12-04T09:15:00.9882022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9882091Z Autotune Choices Stats: 2025-12-04T09:15:00.9883523Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_916", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T09:15:00.9883770Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9883996Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9884313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9885533Z triton_flex_attention_916 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9886663Z triton_flex_attention_917 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9887798Z triton_flex_attention_915 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9888908Z triton_flex_attention_912 0.0143 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9890105Z triton_flex_attention_914 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9891265Z triton_flex_attention_913 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9891525Z SingleProcess AUTOTUNE benchmarking takes 0.2932 seconds and 1.3622 seconds precompiling for 6 choices 2025-12-04T09:15:00.9891592Z Autotune Choices Stats: 2025-12-04T09:15:00.9893135Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_918", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.9893574Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9893908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9894486Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9895648Z triton_flex_attention_backward_918 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9896812Z triton_flex_attention_backward_919 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9898055Z triton_flex_attention_backward_920 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9899249Z triton_flex_attention_backward_921 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9900431Z triton_flex_attention_backward_923 0.0152 ms 94.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9901655Z triton_flex_attention_backward_922 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9902815Z triton_flex_attention_backward_924 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9903974Z triton_flex_attention_backward_925 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9905139Z triton_flex_attention_backward_926 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9906296Z triton_flex_attention_backward_927 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9906587Z SingleProcess AUTOTUNE benchmarking takes 0.6666 seconds and 2.2888 seconds precompiling for 13 choices 2025-12-04T09:15:00.9906717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:15:00.9906786Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:15:00.9906855Z unimplemented [] 2025-12-04T09:15:00.9906998Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:15:00.9907228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:15:00.9908401Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:15:00.9908464Z graph_break [] 2025-12-04T09:15:00.9908598Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:15:00.9908666Z Autotune Choices Stats: 2025-12-04T09:15:00.9910101Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_935", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01033599954098463, "best_triton_pos": 0} 2025-12-04T09:15:00.9910350Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9910574Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9910889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9912030Z triton_flex_attention_935 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9913145Z triton_flex_attention_936 0.0111 ms 92.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9914270Z triton_flex_attention_931 0.0133 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9915459Z triton_flex_attention_933 0.0133 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T09:15:00.9916613Z triton_flex_attention_934 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T09:15:00.9917760Z triton_flex_attention_932 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9918015Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.3368 seconds precompiling for 6 choices 2025-12-04T09:15:00.9918084Z Autotune Choices Stats: 2025-12-04T09:15:00.9919523Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_937", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T09:15:00.9919958Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:15:00.9920293Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:15:00.9920860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:15:00.9922040Z triton_flex_attention_backward_937 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9923202Z triton_flex_attention_backward_938 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9924431Z triton_flex_attention_backward_939 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9925614Z triton_flex_attention_backward_940 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9926795Z triton_flex_attention_backward_942 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9927952Z triton_flex_attention_backward_943 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T09:15:00.9929105Z triton_flex_attention_backward_941 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T09:15:00.9930262Z triton_flex_attention_backward_944 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T09:15:00.9931421Z triton_flex_attention_backward_946 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T09:15:00.9932700Z triton_flex_attention_backward_949 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T09:15:00.9933005Z SingleProcess AUTOTUNE benchmarking takes 0.6678 seconds and 2.3482 seconds precompiling for 13 choices 2025-12-04T09:15:00.9933462Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8a3fc0e970c9a98f.xml - 2025-12-04T09:15:00.9933569Z =========================== short test summary info ============================ 2025-12-04T09:15:00.9934173Z FAILED [8.6616s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpmh6p_qgg/flex_attention_configs.json was not created 2025-12-04T09:15:00.9934179Z 2025-12-04T09:15:00.9934314Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9934571Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9934574Z 2025-12-04T09:15:00.9934742Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9935334Z FAILED [8.4222s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp804t851x/flex_attention_configs.json was not created 2025-12-04T09:15:00.9935338Z 2025-12-04T09:15:00.9935467Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9935717Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9935721Z 2025-12-04T09:15:00.9935887Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9936435Z FAILED [8.2706s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpz5j3y52j/flex_attention_configs.json was not created 2025-12-04T09:15:00.9936440Z 2025-12-04T09:15:00.9936576Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9936827Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9936831Z 2025-12-04T09:15:00.9936986Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9937523Z FAILED [8.5769s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpe9s264og/flex_attention_configs.json was not created 2025-12-04T09:15:00.9937528Z 2025-12-04T09:15:00.9937649Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9937898Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9937901Z 2025-12-04T09:15:00.9938052Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9938594Z FAILED [8.5634s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpdl7gvigl/flex_attention_configs.json was not created 2025-12-04T09:15:00.9938599Z 2025-12-04T09:15:00.9938720Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9938965Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9939010Z 2025-12-04T09:15:00.9939172Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9939705Z FAILED [8.7935s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpmmrfilt6/flex_attention_configs.json was not created 2025-12-04T09:15:00.9939711Z 2025-12-04T09:15:00.9939838Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9940139Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9940175Z 2025-12-04T09:15:00.9940343Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9941048Z FAILED [8.5336s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpo4q1ub_j/flex_attention_configs.json was not created 2025-12-04T09:15:00.9941054Z 2025-12-04T09:15:00.9941184Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9941448Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9941451Z 2025-12-04T09:15:00.9941610Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9942159Z FAILED [8.4584s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp14xpub2o/flex_attention_configs.json was not created 2025-12-04T09:15:00.9942164Z 2025-12-04T09:15:00.9942326Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9942572Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9942582Z 2025-12-04T09:15:00.9942733Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9943265Z FAILED [8.8574s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpy18z1z5_/flex_attention_configs.json was not created 2025-12-04T09:15:00.9943269Z 2025-12-04T09:15:00.9943396Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9943639Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9943642Z 2025-12-04T09:15:00.9943800Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9944334Z FAILED [8.6267s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpc046nj78/flex_attention_configs.json was not created 2025-12-04T09:15:00.9944338Z 2025-12-04T09:15:00.9944459Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9944710Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9944713Z 2025-12-04T09:15:00.9944862Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9945399Z FAILED [8.9885s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpv3f897i8/flex_attention_configs.json was not created 2025-12-04T09:15:00.9945403Z 2025-12-04T09:15:00.9945532Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9945784Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9945794Z 2025-12-04T09:15:00.9945948Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9946476Z FAILED [8.7940s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpm745bxoj/flex_attention_configs.json was not created 2025-12-04T09:15:00.9946517Z 2025-12-04T09:15:00.9946647Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9946890Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9946894Z 2025-12-04T09:15:00.9947053Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9947616Z FAILED [8.6331s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpz_msg098/flex_attention_configs.json was not created 2025-12-04T09:15:00.9947651Z 2025-12-04T09:15:00.9947773Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9948027Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9948032Z 2025-12-04T09:15:00.9948184Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9948729Z FAILED [8.7056s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpede0d40r/flex_attention_configs.json was not created 2025-12-04T09:15:00.9948733Z 2025-12-04T09:15:00.9948855Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9949099Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9949117Z 2025-12-04T09:15:00.9949313Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9949846Z FAILED [8.6225s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpqpok_9h7/flex_attention_configs.json was not created 2025-12-04T09:15:00.9949852Z 2025-12-04T09:15:00.9949980Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9950222Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9950225Z 2025-12-04T09:15:00.9950382Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9950912Z FAILED [9.0560s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp_l7fygn8/flex_attention_configs.json was not created 2025-12-04T09:15:00.9950918Z 2025-12-04T09:15:00.9951042Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9951292Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9951295Z 2025-12-04T09:15:00.9951445Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9951986Z FAILED [8.6752s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpfec1pktd/flex_attention_configs.json was not created 2025-12-04T09:15:00.9951989Z 2025-12-04T09:15:00.9952108Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9952361Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9952364Z 2025-12-04T09:15:00.9952515Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9953050Z FAILED [8.6261s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp_por9z0t/flex_attention_configs.json was not created 2025-12-04T09:15:00.9953055Z 2025-12-04T09:15:00.9953184Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9953428Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9953470Z 2025-12-04T09:15:00.9953630Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9954163Z FAILED [9.3884s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp2gny9fyn/flex_attention_configs.json was not created 2025-12-04T09:15:00.9954167Z 2025-12-04T09:15:00.9954289Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9954576Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9954610Z 2025-12-04T09:15:00.9954764Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9955303Z FAILED [8.6865s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpa8611ro_/flex_attention_configs.json was not created 2025-12-04T09:15:00.9955309Z 2025-12-04T09:15:00.9955428Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9955677Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9955680Z 2025-12-04T09:15:00.9955830Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9956402Z FAILED [8.9005s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpw59c6t4n/flex_attention_configs.json was not created 2025-12-04T09:15:00.9956407Z 2025-12-04T09:15:00.9956535Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9956778Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9956781Z 2025-12-04T09:15:00.9956945Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9957487Z FAILED [8.9845s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp92s2b8yb/flex_attention_configs.json was not created 2025-12-04T09:15:00.9957492Z 2025-12-04T09:15:00.9957611Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9957866Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9957869Z 2025-12-04T09:15:00.9958022Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9958565Z FAILED [9.3737s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp2f1po3gr/flex_attention_configs.json was not created 2025-12-04T09:15:00.9958568Z 2025-12-04T09:15:00.9958693Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9958944Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9958947Z 2025-12-04T09:15:00.9959096Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9959632Z FAILED [8.9413s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp0r9jrtv3/flex_attention_configs.json was not created 2025-12-04T09:15:00.9959636Z 2025-12-04T09:15:00.9959764Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9960011Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9960014Z 2025-12-04T09:15:00.9960168Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9960704Z FAILED [8.7159s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpwwy855w8/flex_attention_configs.json was not created 2025-12-04T09:15:00.9960747Z 2025-12-04T09:15:00.9960878Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9961123Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9961126Z 2025-12-04T09:15:00.9961276Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9962066Z FAILED [8.7410s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpne01nos0/flex_attention_configs.json was not created 2025-12-04T09:15:00.9962117Z 2025-12-04T09:15:00.9962263Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9962531Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9962536Z 2025-12-04T09:15:00.9962695Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9963243Z FAILED [8.8815s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpkvey3fd3/flex_attention_configs.json was not created 2025-12-04T09:15:00.9963253Z 2025-12-04T09:15:00.9963381Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9963682Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9963687Z 2025-12-04T09:15:00.9963847Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9964386Z FAILED [8.7821s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpsch_rjii/flex_attention_configs.json was not created 2025-12-04T09:15:00.9964391Z 2025-12-04T09:15:00.9964520Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9964768Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9964772Z 2025-12-04T09:15:00.9964922Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9965468Z FAILED [8.6779s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp7i52oz8u/flex_attention_configs.json was not created 2025-12-04T09:15:00.9965475Z 2025-12-04T09:15:00.9965596Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9965848Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9965851Z 2025-12-04T09:15:00.9966001Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9966535Z FAILED [8.7016s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp3f354stf/flex_attention_configs.json was not created 2025-12-04T09:15:00.9966549Z 2025-12-04T09:15:00.9966679Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9966925Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9966928Z 2025-12-04T09:15:00.9967086Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9967621Z FAILED [8.9423s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpxa4mrgd0/flex_attention_configs.json was not created 2025-12-04T09:15:00.9967625Z 2025-12-04T09:15:00.9967751Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9968050Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9968054Z 2025-12-04T09:15:00.9968206Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9968747Z FAILED [8.9614s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmptpkybqre/flex_attention_configs.json was not created 2025-12-04T09:15:00.9968751Z 2025-12-04T09:15:00.9968908Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9969191Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9969194Z 2025-12-04T09:15:00.9969348Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9969878Z FAILED [8.6491s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp_4pivx94/flex_attention_configs.json was not created 2025-12-04T09:15:00.9969890Z 2025-12-04T09:15:00.9970011Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9970258Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9970261Z 2025-12-04T09:15:00.9970419Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9970993Z FAILED [8.9966s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp5v2t29we/flex_attention_configs.json was not created 2025-12-04T09:15:00.9970999Z 2025-12-04T09:15:00.9971132Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9971382Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9971387Z 2025-12-04T09:15:00.9971540Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9972084Z FAILED [9.0881s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpg4bk3jz1/flex_attention_configs.json was not created 2025-12-04T09:15:00.9972088Z 2025-12-04T09:15:00.9972210Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9972526Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9972533Z 2025-12-04T09:15:00.9972698Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9973238Z FAILED [8.8634s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpvh1jtdth/flex_attention_configs.json was not created 2025-12-04T09:15:00.9973249Z 2025-12-04T09:15:00.9973372Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9973616Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9973620Z 2025-12-04T09:15:00.9973777Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9974316Z FAILED [9.0042s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpjvws9kal/flex_attention_configs.json was not created 2025-12-04T09:15:00.9974321Z 2025-12-04T09:15:00.9974452Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9974701Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9974704Z 2025-12-04T09:15:00.9974856Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9975436Z FAILED [9.0146s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpmpfvm_pd/flex_attention_configs.json was not created 2025-12-04T09:15:00.9975440Z 2025-12-04T09:15:00.9975565Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9975820Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9975823Z 2025-12-04T09:15:00.9976007Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9976576Z FAILED [9.0749s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp06g1qqtu/flex_attention_configs.json was not created 2025-12-04T09:15:00.9976585Z 2025-12-04T09:15:00.9976708Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9976957Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9976960Z 2025-12-04T09:15:00.9977116Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9977656Z FAILED [9.3859s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpup5bjcux/flex_attention_configs.json was not created 2025-12-04T09:15:00.9977659Z 2025-12-04T09:15:00.9977790Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9978076Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9978080Z 2025-12-04T09:15:00.9978232Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9978776Z FAILED [8.9844s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmphg47bmc3/flex_attention_configs.json was not created 2025-12-04T09:15:00.9978782Z 2025-12-04T09:15:00.9978907Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9979158Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9979161Z 2025-12-04T09:15:00.9979315Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9979863Z FAILED [8.8614s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpdl9kigxh/flex_attention_configs.json was not created 2025-12-04T09:15:00.9979868Z 2025-12-04T09:15:00.9979989Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9980235Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9980240Z 2025-12-04T09:15:00.9980399Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9980937Z FAILED [8.9624s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpjn0myk4t/flex_attention_configs.json was not created 2025-12-04T09:15:00.9980940Z 2025-12-04T09:15:00.9981070Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9981318Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9981323Z 2025-12-04T09:15:00.9981476Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9982018Z FAILED [9.1600s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpdv3ka747/flex_attention_configs.json was not created 2025-12-04T09:15:00.9982061Z 2025-12-04T09:15:00.9982187Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9982438Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9982442Z 2025-12-04T09:15:00.9982592Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9983141Z FAILED [9.0727s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmptn9lux0k/flex_attention_configs.json was not created 2025-12-04T09:15:00.9983210Z 2025-12-04T09:15:00.9983336Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9983587Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9983590Z 2025-12-04T09:15:00.9983748Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9984284Z FAILED [9.1677s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp8f2jgdv8/flex_attention_configs.json was not created 2025-12-04T09:15:00.9984289Z 2025-12-04T09:15:00.9984413Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9984663Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9984666Z 2025-12-04T09:15:00.9984827Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9985404Z FAILED [8.6989s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp2i0vhuu1/flex_attention_configs.json was not created 2025-12-04T09:15:00.9985408Z 2025-12-04T09:15:00.9985535Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9985789Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9985792Z 2025-12-04T09:15:00.9985947Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9986484Z FAILED [8.8898s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp8zz3dbsv/flex_attention_configs.json was not created 2025-12-04T09:15:00.9986488Z 2025-12-04T09:15:00.9986609Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9986858Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9986862Z 2025-12-04T09:15:00.9987018Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9987550Z FAILED [8.8961s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpa4rfsguv/flex_attention_configs.json was not created 2025-12-04T09:15:00.9987554Z 2025-12-04T09:15:00.9987687Z To execute this test, run the following from the base repo dir: 2025-12-04T09:15:00.9987937Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:15:00.9987940Z 2025-12-04T09:15:00.9988096Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:15:00.9988213Z =================== 49 failed, 1 passed in 445.36s (0:07:25) =================== 2025-12-04T09:15:00.9988217Z 2025-12-04T09:15:00.9988564Z FINISHED PRINTING LOG FILE of inductor/test_flex_attention 1/6 (test/test-reports/inductor.test_flex_attention_1.6_88265ee86f308f18_.log) 2025-12-04T09:15:00.9988567Z 2025-12-04T09:15:00.9988784Z Finished inductor/test_flex_attention 1/6 ... [2025-12-04 09:14:57.215914][1340.879721231], took 7.57min 2025-12-04T09:15:00.9989271Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8a3fc0e970c9a98f.xml 2025-12-04T09:15:00.9989413Z Uploading logs for 57116084853 to S3 2025-12-04T09:15:00.9989494Z Uploading artifacts took 1.32 seconds 2025-12-04T09:15:00.9989576Z inductor/test_flex_attention 1/6 failed! 2025-12-04T09:15:00.9989763Z Running inductor/test_flex_attention 2/6 ... [2025-12-04 09:14:58.919333][1342.583149025] 2025-12-04T09:15:00.9989842Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:15:00.9990596Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=2', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:14:58.919705] 2025-12-04T09:15:05.5103399Z 2025-12-04T09:15:05.5104545Z inductor/test_flex_attention 2/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_2.6_4c127f8c0344b8ae_.log 2025-12-04T09:15:05.5105415Z Running 0 items in this shard: 2025-12-04T09:15:05.5105608Z 2025-12-04T09:15:05.5105914Z Finished inductor/test_flex_attention 2/6 ... [2025-12-04 09:15:05.510195][1349.174013654], took 0.11min 2025-12-04T09:15:05.5114212Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-51eef588a0930662.xml 2025-12-04T09:15:05.5828015Z Running inductor/test_flex_attention 3/6 ... [2025-12-04 09:15:05.582542][1349.246362481] 2025-12-04T09:15:05.5828629Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:15:05.5831530Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=3', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:15:05.582830] 2025-12-04T09:15:12.1200996Z 2025-12-04T09:15:12.1201910Z inductor/test_flex_attention 3/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_3.6_df984f4b37c4889f_.log 2025-12-04T09:15:12.1202668Z Running 0 items in this shard: 2025-12-04T09:15:12.1202843Z 2025-12-04T09:15:12.1203098Z Finished inductor/test_flex_attention 3/6 ... [2025-12-04 09:15:12.119843][1355.783662229], took 0.11min 2025-12-04T09:15:12.1212183Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-af36cb9c6428bfea.xml 2025-12-04T09:15:12.1968978Z Running inductor/test_flex_attention 4/6 ... [2025-12-04 09:15:12.196632][1355.860452074] 2025-12-04T09:15:12.1969458Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:15:12.1972018Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=4', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:15:12.196910] 2025-12-04T09:15:18.7104493Z 2025-12-04T09:15:18.7105425Z inductor/test_flex_attention 4/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_4.6_453bb167963be69f_.log 2025-12-04T09:15:18.7106256Z Running 0 items in this shard: 2025-12-04T09:15:18.7106438Z 2025-12-04T09:15:18.7106768Z Finished inductor/test_flex_attention 4/6 ... [2025-12-04 09:15:18.710158][1362.373975767], took 0.11min 2025-12-04T09:15:18.7117680Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-37c6f4c91cf1b10c.xml 2025-12-04T09:15:18.7901310Z Running inductor/test_flex_attention 5/6 ... [2025-12-04 09:15:18.789870][1362.453688622] 2025-12-04T09:15:18.7901829Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:15:18.7904853Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=5', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:15:18.790187] 2025-12-04T09:15:25.2945284Z 2025-12-04T09:15:25.2946286Z inductor/test_flex_attention 5/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_5.6_691743a62e38d3bc_.log 2025-12-04T09:15:25.2947328Z Running 0 items in this shard: 2025-12-04T09:15:25.2947616Z 2025-12-04T09:15:25.2947913Z Finished inductor/test_flex_attention 5/6 ... [2025-12-04 09:15:25.294288][1368.958104887], took 0.11min 2025-12-04T09:15:25.2960411Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-0f855b036b683cf8.xml 2025-12-04T09:15:25.3763896Z Running inductor/test_flex_attention 6/6 ... [2025-12-04 09:15:25.376085][1369.039904249] 2025-12-04T09:15:25.3764452Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:15:25.3766937Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=6', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:15:25.376412] 2025-12-04T09:17:01.1869862Z 2025-12-04T09:17:01.1871667Z inductor/test_flex_attention 6/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_6.6_878f54c09ed3a0a3_.log 2025-12-04T09:17:01.1914914Z Running 100 items in this shard: test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16 2025-12-04T09:17:01.1951120Z 2025-12-04T09:17:01.1951369Z Finished inductor/test_flex_attention 6/6 ... [2025-12-04 09:17:01.186991][1464.850805536], took 1.60min 2025-12-04T09:17:01.1952190Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-fba11c84b84fb152.xml 2025-12-04T09:17:01.2773710Z Running dynamo/test_fake_distributed 1/1 ... [2025-12-04 09:17:01.277089][1464.940907308] 2025-12-04T09:17:01.2774383Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:01.2777831Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_fake_distributed.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:01.277440] 2025-12-04T09:17:04.3537083Z 2025-12-04T09:17:04.3538152Z dynamo/test_fake_distributed 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_fake_distributed_1.1_2823f9f71dd63ef0_.log 2025-12-04T09:17:04.3539109Z Running 0 items in this shard: 2025-12-04T09:17:04.3539317Z 2025-12-04T09:17:04.3539702Z Finished dynamo/test_fake_distributed 1/1 ... [2025-12-04 09:17:04.353451][1468.017267652], took 0.05min 2025-12-04T09:17:04.3553821Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_fake_distributed/dynamo.test_fake_distributed-23dd0c70b083b9a3.xml 2025-12-04T09:17:04.3833656Z Running test_ci_sanity_check_fail 1/1 ... [2025-12-04 09:17:04.383123][1468.046941448] 2025-12-04T09:17:04.3834424Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:04.3837445Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ci_sanity_check_fail.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:04.383453] 2025-12-04T09:17:07.1749304Z Finished test_ci_sanity_check_fail 1/1 ... [2025-12-04 09:17:07.174513][1470.838320285], took 0.05min 2025-12-04T09:17:07.1769090Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-05128a5df70d9533.xml 2025-12-04T09:17:08.5659175Z Uploading logs for 57116084853 to S3 2025-12-04T09:17:08.7441674Z Uploading artifacts took 1.54 seconds 2025-12-04T09:17:08.7442339Z test_ci_sanity_check_fail 1/1 failed! 2025-12-04T09:17:08.7446421Z Running inductor/test_max_autotune 1/1 ... [2025-12-04 09:17:08.744443][1472.408260363] 2025-12-04T09:17:08.7447012Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:08.7451102Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_max_autotune.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:08.744843] 2025-12-04T09:17:14.0917266Z 2025-12-04T09:17:14.0919050Z inductor/test_max_autotune 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_max_autotune_1.1_6b81222e05fc6886_.log 2025-12-04T09:17:14.0920395Z 2025-12-04T09:17:14.0920844Z Finished inductor/test_max_autotune 1/1 ... [2025-12-04 09:17:14.091457][1477.755272776], took 0.09min 2025-12-04T09:17:14.0940284Z Running doctests 1/1 ... [2025-12-04 09:17:14.093799][1477.757617928] 2025-12-04T09:17:14.5451778Z msg = Cannot scrape callname=Library.fallback in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=368. 2025-12-04T09:17:14.5452862Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:14.5453475Z Registers the function implementation as the fallback for the given key. 2025-12-04T09:17:14.5453825Z 2025-12-04T09:17:14.5454048Z This function only works for a library with global namespace ("_"). 2025-12-04T09:17:14.5454373Z 2025-12-04T09:17:14.5454453Z Args: 2025-12-04T09:17:14.5454894Z fn: function used as fallback for the given dispatch key or :func:`~fallthrough_kernel` 2025-12-04T09:17:14.5455424Z to register a fallthrough. 2025-12-04T09:17:14.5455975Z dispatch_key: dispatch key that the input function should be registered for. By default, it uses 2025-12-04T09:17:14.5456577Z the dispatch key that the library was created with. 2025-12-04T09:17:14.5457144Z with_keyset: flag controlling if the current dispatcher call keyset should be passed as the first argument 2025-12-04T09:17:14.5457798Z to :attr:`fn` when calling. This should be used to create the appropriate keyset for redispatch calls. 2025-12-04T09:17:14.5458147Z 2025-12-04T09:17:14.5458240Z Example:: 2025-12-04T09:17:14.5458356Z 2025-12-04T09:17:14.5458444Z >>> my_lib = Library("_", "IMPL") 2025-12-04T09:17:14.5458720Z >>> def fallback_kernel(op, *args, **kwargs): 2025-12-04T09:17:14.5459021Z >>> # Handle all autocast ops generically 2025-12-04T09:17:14.5459276Z >>> # ... 2025-12-04T09:17:14.5459514Z >>> my_lib.fallback(fallback_kernel, "Autocast") 2025-12-04T09:17:14.5459781Z 2025-12-04T09:17:14.5460364Z Original Error: IndentationError('expected an indented block after function definition on line 2', ('', 5, 1, 'my_lib.fallback(fallback_kernel, "Autocast")\n', 5, 7)) 2025-12-04T09:17:14.5461192Z 2025-12-04T09:17:14.5461296Z my_lib.fallback(fallback_kernel, "Autocast") 2025-12-04T09:17:14.5461558Z ^ 2025-12-04T09:17:14.5570881Z msg = Cannot scrape callname=register_fake in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=958. 2025-12-04T09:17:14.5571782Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:14.5572299Z Register a FakeTensor implementation ("fake impl") for this operator. 2025-12-04T09:17:14.5572810Z 2025-12-04T09:17:14.5573111Z Also sometimes known as a "meta kernel", "abstract impl". 2025-12-04T09:17:14.5573437Z 2025-12-04T09:17:14.5574160Z An "FakeTensor implementation" specifies the behavior of this operator on 2025-12-04T09:17:14.5575031Z Tensors that carry no data ("FakeTensor"). Given some input Tensors with 2025-12-04T09:17:14.5575712Z certain properties (sizes/strides/storage_offset/device), it specifies 2025-12-04T09:17:14.5576149Z what the properties of the output Tensors are. 2025-12-04T09:17:14.5576412Z 2025-12-04T09:17:14.5576601Z The FakeTensor implementation has the same signature as the operator. 2025-12-04T09:17:14.5583386Z It is run for both FakeTensors and meta tensors. To write a FakeTensor 2025-12-04T09:17:14.5583817Z implementation, assume that all Tensor inputs to the operator are 2025-12-04T09:17:14.5584198Z regular CPU/CUDA/Meta tensors, but they do not have storage, and 2025-12-04T09:17:14.5584574Z you are trying to return regular CPU/CUDA/Meta tensor(s) as output. 2025-12-04T09:17:14.5584973Z The FakeTensor implementation must consist of only PyTorch operations 2025-12-04T09:17:14.5585504Z (and may not directly access the storage or data of any input or 2025-12-04T09:17:14.5585809Z intermediate Tensors). 2025-12-04T09:17:14.5585944Z 2025-12-04T09:17:14.5586053Z This API may be used as a decorator (see examples). 2025-12-04T09:17:14.5586239Z 2025-12-04T09:17:14.5586353Z For a detailed guide on custom ops, please see 2025-12-04T09:17:14.5586722Z https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html 2025-12-04T09:17:14.5586979Z 2025-12-04T09:17:14.5587039Z Args: 2025-12-04T09:17:14.5587289Z op_name: Operator name (along with the overload) or OpOverload object. 2025-12-04T09:17:14.5587608Z func: Fake tensor implementation. 2025-12-04T09:17:14.5587900Z lib (Optional[Library]): Library to register the fake tensor to. 2025-12-04T09:17:14.5588238Z allow_override: Flag controlling if we want to override an 2025-12-04T09:17:14.5588566Z existing registered fake impl. This is by default off, 2025-12-04T09:17:14.5588889Z and will error you're trying to register a fake impl to 2025-12-04T09:17:14.5589202Z an operator that already has a fake impl. This also only 2025-12-04T09:17:14.5589503Z applies if the custom operator was not created via 2025-12-04T09:17:14.5589814Z torch.library.custom_op, as overriding and existing fake 2025-12-04T09:17:14.5590110Z impl is already allowed. 2025-12-04T09:17:14.5590264Z 2025-12-04T09:17:14.5590325Z Examples: 2025-12-04T09:17:14.5590480Z >>> import torch 2025-12-04T09:17:14.5590659Z >>> import numpy as np 2025-12-04T09:17:14.5590857Z >>> from torch import Tensor 2025-12-04T09:17:14.5591054Z >>> 2025-12-04T09:17:14.5591264Z >>> # Example 1: an operator without data-dependent output shape 2025-12-04T09:17:14.5591625Z >>> @torch.library.custom_op("mylib::custom_linear", mutates_args=()) 2025-12-04T09:17:14.5592009Z >>> def custom_linear(x: Tensor, weight: Tensor, bias: Tensor) -> Tensor: 2025-12-04T09:17:14.5592361Z >>> raise NotImplementedError("Implementation goes here") 2025-12-04T09:17:14.5592612Z >>> 2025-12-04T09:17:14.5592815Z >>> @torch.library.register_fake("mylib::custom_linear") 2025-12-04T09:17:14.5593078Z >>> def _(x, weight, bias): 2025-12-04T09:17:14.5593380Z >>> assert x.dim() == 2 2025-12-04T09:17:14.5593586Z >>> assert weight.dim() == 2 2025-12-04T09:17:14.5593805Z >>> assert bias.dim() == 1 2025-12-04T09:17:14.5594024Z >>> assert x.shape[1] == weight.shape[1] 2025-12-04T09:17:14.5594279Z >>> assert weight.shape[0] == bias.shape[0] 2025-12-04T09:17:14.5594533Z >>> assert x.device == weight.device 2025-12-04T09:17:14.5594740Z >>> 2025-12-04T09:17:14.5594906Z >>> return (x @ weight.t()) + bias 2025-12-04T09:17:14.5595163Z >>> 2025-12-04T09:17:14.5595415Z >>> with torch._subclasses.fake_tensor.FakeTensorMode(): 2025-12-04T09:17:14.5595675Z >>> x = torch.randn(2, 3) 2025-12-04T09:17:14.5595897Z >>> w = torch.randn(3, 3) 2025-12-04T09:17:14.5596099Z >>> b = torch.randn(3) 2025-12-04T09:17:14.5596324Z >>> y = torch.ops.mylib.custom_linear(x, w, b) 2025-12-04T09:17:14.5596563Z >>> 2025-12-04T09:17:14.5596720Z >>> assert y.shape == (2, 3) 2025-12-04T09:17:14.5596904Z >>> 2025-12-04T09:17:14.5597103Z >>> # Example 2: an operator with data-dependent output shape 2025-12-04T09:17:14.5597451Z >>> @torch.library.custom_op("mylib::custom_nonzero", mutates_args=()) 2025-12-04T09:17:14.5597765Z >>> def custom_nonzero(x: Tensor) -> Tensor: 2025-12-04T09:17:14.5598008Z >>> x_np = x.numpy(force=True) 2025-12-04T09:17:14.5598249Z >>> res = np.stack(np.nonzero(x_np), axis=1) 2025-12-04T09:17:14.5598512Z >>> return torch.tensor(res, device=x.device) 2025-12-04T09:17:14.5598781Z >>> 2025-12-04T09:17:14.5598990Z >>> @torch.library.register_fake("mylib::custom_nonzero") 2025-12-04T09:17:14.5599246Z >>> def _(x): 2025-12-04T09:17:14.5599450Z >>> # Number of nonzero-elements is data-dependent. 2025-12-04T09:17:14.5599733Z >>> # Since we cannot peek at the data in an fake impl, 2025-12-04T09:17:14.5600025Z >>> # we use the ctx object to construct a new symint that 2025-12-04T09:17:14.5600294Z >>> # represents the data-dependent size. 2025-12-04T09:17:14.5600535Z >>> ctx = torch.library.get_ctx() 2025-12-04T09:17:14.5600776Z >>> nnz = ctx.new_dynamic_size() 2025-12-04T09:17:14.5600997Z >>> shape = [nnz, x.dim()] 2025-12-04T09:17:14.5601234Z >>> result = x.new_empty(shape, dtype=torch.int64) 2025-12-04T09:17:14.5601478Z >>> return result 2025-12-04T09:17:14.5601652Z >>> 2025-12-04T09:17:14.5601860Z >>> from torch.fx.experimental.proxy_tensor import make_fx 2025-12-04T09:17:14.5602117Z >>> 2025-12-04T09:17:14.5602274Z >>> x = torch.tensor([0, 1, 2, 3, 4, 0]) 2025-12-04T09:17:14.5602591Z >>> trace = make_fx(torch.ops.mylib.custom_nonzero, tracing_mode="symbolic")(x) 2025-12-04T09:17:14.5602916Z >>> trace.print_readable() 2025-12-04T09:17:14.5603109Z >>> 2025-12-04T09:17:14.5603350Z >>> assert torch.allclose(trace(x), torch.ops.mylib.custom_nonzero(x)) 2025-12-04T09:17:14.5603581Z 2025-12-04T09:17:14.5603635Z 2025-12-04T09:17:14.5604072Z Original Error: IndentationError('expected an indented block after function definition on line 37', ('', 38, 1, '_._ = None\n', 38, 2)) 2025-12-04T09:17:14.5604497Z 2025-12-04T09:17:14.5604561Z _._ = None 2025-12-04T09:17:14.5604692Z ^ 2025-12-04T09:17:14.5689326Z msg = Cannot scrape callname=get_kernel in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=1530. 2025-12-04T09:17:14.5690511Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:14.5691026Z Returns the computed kernel for a given operator and dispatch key. 2025-12-04T09:17:14.5691301Z 2025-12-04T09:17:14.5691501Z This function retrieves the kernel that would be executed for a given 2025-12-04T09:17:14.5691994Z operator and dispatch key combination. The returned SafeKernelFunction 2025-12-04T09:17:14.5692702Z can be used to call the kernel in a boxed fashion. The intended use 2025-12-04T09:17:14.5693142Z case for this function is to retrieve the original kernel for a given 2025-12-04T09:17:14.5693590Z dispatch key and then register another kernel to the same dispatch key 2025-12-04T09:17:14.5694007Z that calls into the original kernel for certain cases. 2025-12-04T09:17:14.5694239Z 2025-12-04T09:17:14.5694318Z Args: 2025-12-04T09:17:14.5694599Z op: Operator name (along with the overload) or OpOverload object 2025-12-04T09:17:14.5695155Z Can be a string (e.g., "aten::add.Tensor"), an OpOverload, or a CustomOpDef. 2025-12-04T09:17:14.5695734Z dispatch_key (str | torch.DispatchKey): The dispatch key to get the kernel for. 2025-12-04T09:17:14.5696210Z Can be a string (e.g., "CPU", "CUDA") or a DispatchKey enum value. 2025-12-04T09:17:14.5696468Z 2025-12-04T09:17:14.5696540Z Returns: 2025-12-04T09:17:14.5696851Z torch._C._SafeKernelFunction: A safe kernel function that can be used to 2025-12-04T09:17:14.5697160Z call the kernel. 2025-12-04T09:17:14.5697281Z 2025-12-04T09:17:14.5697345Z Raises: 2025-12-04T09:17:14.5697527Z RuntimeError: If the operator does not exist. 2025-12-04T09:17:14.5697710Z 2025-12-04T09:17:14.5697769Z Example: 2025-12-04T09:17:14.5697943Z >>> # Get the CPU kernel for torch.add 2025-12-04T09:17:14.5698228Z >>> kernel = torch.library.get_kernel("aten::add.Tensor", "CPU") 2025-12-04T09:17:14.5698510Z >>> 2025-12-04T09:17:14.5698687Z >>> # You can also use DispatchKey enum 2025-12-04T09:17:14.5699079Z >>> kernel = torch.library.get_kernel("aten::add.Tensor", torch.DispatchKey.CPU) 2025-12-04T09:17:14.5699395Z >>> 2025-12-04T09:17:14.5699565Z >>> # Or use an OpOverload directly 2025-12-04T09:17:14.5699871Z >>> kernel = torch.library.get_kernel(torch.ops.aten.add.Tensor, "CPU") 2025-12-04T09:17:14.5700154Z >>> 2025-12-04T09:17:14.5700377Z >>> # Example: Using get_kernel in a custom op with conditional dispatch 2025-12-04T09:17:14.5700684Z >>> # Get the original kernel for torch.sin 2025-12-04T09:17:14.5700985Z >>> original_sin_kernel = torch.library.get_kernel("aten::sin", "CPU") 2025-12-04T09:17:14.5701259Z >>> 2025-12-04T09:17:14.5701492Z >>> # If input has negative values, use original sin, otherwise return zeros 2025-12-04T09:17:14.5701823Z >>> def conditional_sin_impl(dispatch_keys, x): 2025-12-04T09:17:14.5702055Z >>> if (x < 0).any(): 2025-12-04T09:17:14.5702310Z >>> return original_sin_kernel.call_boxed(dispatch_keys, x) 2025-12-04T09:17:14.5702579Z >>> else: 2025-12-04T09:17:14.5702754Z >>> return torch.zeros_like(x) 2025-12-04T09:17:14.5702959Z >>> 2025-12-04T09:17:14.5703134Z >>> lib = torch.library.Library("aten", "IMPL") 2025-12-04T09:17:14.5703479Z >>> # with_keyset=True so the first argument to the impl is the current DispatchKeySet 2025-12-04T09:17:14.5703866Z >>> which needs to be the first argument to ``kernel.call_boxed`` 2025-12-04T09:17:14.5704201Z >>> lib.impl("sin", conditional_sin_impl, "CPU", with_keyset=True) 2025-12-04T09:17:14.5704458Z >>> 2025-12-04T09:17:14.5704607Z >>> # Test the conditional behavior 2025-12-04T09:17:14.5704842Z >>> x_positive = torch.tensor([1.0, 2.0]) 2025-12-04T09:17:14.5705083Z >>> x_mixed = torch.tensor([-1.0, 2.0]) 2025-12-04T09:17:14.5705300Z >>> torch.sin(x_positive) 2025-12-04T09:17:14.5705504Z tensor([0., 0.]) 2025-12-04T09:17:14.5705684Z >>> torch.sin(x_mixed) 2025-12-04T09:17:14.5705869Z tensor([-0.8415, 0.9093]) 2025-12-04T09:17:14.5706050Z 2025-12-04T09:17:14.5706465Z Original Error: SyntaxError('invalid syntax', ('', 23, 7, 'which needs to be the first argument to ``kernel.call_boxed``\n', 23, 12)) 2025-12-04T09:17:14.5706908Z 2025-12-04T09:17:14.5707038Z which needs to be the first argument to ``kernel.call_boxed`` 2025-12-04T09:17:14.5707291Z ^ 2025-12-04T09:17:14.9483145Z msg = Cannot scrape callname=is_available in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py line=70. 2025-12-04T09:17:14.9484135Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:14.9484727Z Check if the current accelerator is available at runtime: it was build, all the 2025-12-04T09:17:14.9485575Z required drivers are available and at least one device is visible. 2025-12-04T09:17:14.9486165Z See :ref:`accelerator` for details. 2025-12-04T09:17:14.9486399Z 2025-12-04T09:17:14.9486488Z Returns: 2025-12-04T09:17:14.9486854Z bool: A boolean indicating if there is an available :ref:`accelerator`. 2025-12-04T09:17:14.9487151Z 2025-12-04T09:17:14.9487354Z .. note:: This API delegates to the device-specific version of `is_available`. 2025-12-04T09:17:14.9487819Z On CUDA, when the environment variable ``PYTORCH_NVML_BASED_CUDA_CHECK=1`` is set, 2025-12-04T09:17:14.9488270Z this function will NOT poison fork. Otherwise, it will. For more details, see 2025-12-04T09:17:14.9488660Z :ref:`multiprocessing-poison-fork-note`. 2025-12-04T09:17:14.9488853Z 2025-12-04T09:17:14.9488921Z Example:: 2025-12-04T09:17:14.9489019Z 2025-12-04T09:17:14.9489220Z >>> assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T09:17:14.9489555Z 2025-12-04T09:17:14.9490138Z Original Error: SyntaxError('invalid syntax', ('', 1, 41, 'assert torch.accelerator.is_available() "No available accelerators detected."\n', 1, 78)) 2025-12-04T09:17:14.9490639Z 2025-12-04T09:17:14.9490830Z assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T09:17:14.9491177Z ^ 2025-12-04T09:17:14.9510191Z msg = Cannot scrape callname=synchronize in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py line=239. 2025-12-04T09:17:14.9511239Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:14.9511738Z Wait for all kernels in all streams on the given device to complete. 2025-12-04T09:17:14.9512006Z 2025-12-04T09:17:14.9512088Z Args: 2025-12-04T09:17:14.9512466Z device (:class:`torch.device`, str, int, optional): device for which to synchronize. It must match 2025-12-04T09:17:14.9513045Z the current :ref:`accelerator` device type. If not given, 2025-12-04T09:17:14.9513513Z use :func:`torch.accelerator.current_device_index` by default. 2025-12-04T09:17:14.9513775Z 2025-12-04T09:17:14.9514046Z .. note:: This function is a no-op if the current :ref:`accelerator` is not initialized. 2025-12-04T09:17:14.9514396Z 2025-12-04T09:17:14.9514474Z Example:: 2025-12-04T09:17:14.9514604Z 2025-12-04T09:17:14.9514726Z >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA) 2025-12-04T09:17:14.9515164Z >>> assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T09:17:14.9515611Z >>> start_event = torch.Event(enable_timing=True) 2025-12-04T09:17:14.9515938Z >>> end_event = torch.Event(enable_timing=True) 2025-12-04T09:17:14.9516244Z >>> start_event.record() 2025-12-04T09:17:14.9516615Z >>> tensor = torch.randn(100, device=torch.accelerator.current_accelerator()) 2025-12-04T09:17:14.9517016Z >>> sum = torch.sum(tensor) 2025-12-04T09:17:14.9517234Z >>> end_event.record() 2025-12-04T09:17:14.9517448Z >>> torch.accelerator.synchronize() 2025-12-04T09:17:14.9517721Z >>> elapsed_time_ms = start_event.elapsed_time(end_event) 2025-12-04T09:17:14.9517977Z 2025-12-04T09:17:14.9518435Z Original Error: SyntaxError('invalid syntax', ('', 2, 41, 'assert torch.accelerator.is_available() "No available accelerators detected."\n', 2, 78)) 2025-12-04T09:17:14.9519012Z 2025-12-04T09:17:14.9519197Z assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T09:17:14.9519517Z ^ 2025-12-04T09:17:14.9755452Z msg = Cannot scrape callname=cudart in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py line=448. 2025-12-04T09:17:14.9756476Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:14.9757029Z Retrieves the CUDA runtime API module. 2025-12-04T09:17:14.9757656Z 2025-12-04T09:17:14.9757664Z 2025-12-04T09:17:14.9758058Z This function initializes the CUDA runtime environment if it is not already 2025-12-04T09:17:14.9758864Z initialized and returns the CUDA runtime API module (_cudart). The CUDA 2025-12-04T09:17:14.9759799Z runtime API module provides access to various CUDA runtime functions. 2025-12-04T09:17:14.9760325Z 2025-12-04T09:17:14.9760420Z Args: 2025-12-04T09:17:14.9760629Z ``None`` 2025-12-04T09:17:14.9760777Z 2025-12-04T09:17:14.9760882Z Returns: 2025-12-04T09:17:14.9761163Z module: The CUDA runtime API module (_cudart). 2025-12-04T09:17:14.9761431Z 2025-12-04T09:17:14.9761527Z Raises: 2025-12-04T09:17:14.9762223Z RuntimeError: If CUDA cannot be re-initialized in a forked subprocess. 2025-12-04T09:17:14.9762988Z AssertionError: If PyTorch is not compiled with CUDA support or if libcudart functions are unavailable. 2025-12-04T09:17:14.9763491Z 2025-12-04T09:17:14.9763641Z Example of CUDA operations with profiling: 2025-12-04T09:17:14.9764096Z >>> import torch 2025-12-04T09:17:14.9764413Z >>> from torch.cuda import cudart, check_error 2025-12-04T09:17:14.9764765Z >>> import os 2025-12-04T09:17:14.9764998Z >>> 2025-12-04T09:17:14.9765240Z >>> os.environ["CUDA_PROFILE"] = "1" 2025-12-04T09:17:14.9765554Z >>> 2025-12-04T09:17:14.9765826Z >>> def perform_cuda_operations_with_streams(): 2025-12-04T09:17:14.9766207Z >>> stream = torch.cuda.Stream() 2025-12-04T09:17:14.9766558Z >>> with torch.cuda.stream(stream): 2025-12-04T09:17:14.9766930Z >>> x = torch.randn(100, 100, device='cuda') 2025-12-04T09:17:14.9767235Z >>> y = torch.randn(100, 100, device='cuda') 2025-12-04T09:17:14.9767514Z >>> z = torch.mul(x, y) 2025-12-04T09:17:14.9767761Z >>> return z 2025-12-04T09:17:14.9767939Z >>> 2025-12-04T09:17:14.9768120Z >>> torch.cuda.synchronize() 2025-12-04T09:17:14.9768386Z >>> print("====== Start nsys profiling ======") 2025-12-04T09:17:14.9768673Z >>> check_error(cudart().cudaProfilerStart()) 2025-12-04T09:17:14.9768971Z >>> with torch.autograd.profiler.emit_nvtx(): 2025-12-04T09:17:14.9769286Z >>> result = perform_cuda_operations_with_streams() 2025-12-04T09:17:14.9769589Z >>> print("CUDA operations completed.") 2025-12-04T09:17:14.9769895Z >>> check_error(torch.cuda.cudart().cudaProfilerStop()) 2025-12-04T09:17:14.9770203Z >>> print("====== End nsys profiling ======") 2025-12-04T09:17:14.9770383Z 2025-12-04T09:17:14.9770541Z To run this example and save the profiling information, execute: 2025-12-04T09:17:14.9771067Z >>> $ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py 2025-12-04T09:17:14.9771461Z 2025-12-04T09:17:14.9771660Z This command profiles the CUDA operations in the provided script and saves 2025-12-04T09:17:14.9772104Z the profiling information to a file named `trace_name.prof`. 2025-12-04T09:17:14.9772680Z The `--profile-from-start off` option ensures that profiling starts only 2025-12-04T09:17:14.9773062Z after the `cudaProfilerStart` call in the script. 2025-12-04T09:17:14.9773441Z The `--csv` and `--print-summary` options format the profiling output as a 2025-12-04T09:17:14.9773901Z CSV file and print a summary, respectively. 2025-12-04T09:17:14.9774284Z The `-o` option specifies the output file name, and the `-f` option forces the 2025-12-04T09:17:14.9774683Z overwrite of the output file if it already exists. 2025-12-04T09:17:14.9774959Z 2025-12-04T09:17:14.9775565Z Original Error: SyntaxError('invalid syntax', ('', 1, 1, '$ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py\n', 1, 2)) 2025-12-04T09:17:14.9776166Z 2025-12-04T09:17:14.9776502Z $ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py 2025-12-04T09:17:14.9776997Z ^ 2025-12-04T09:17:16.0801039Z msg = Cannot scrape callname=vmap in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/apis.py line=39. 2025-12-04T09:17:16.0802137Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:16.0802531Z 2025-12-04T09:17:16.0802795Z vmap is the vectorizing map; ``vmap(func)`` returns a new function that 2025-12-04T09:17:16.0803348Z maps ``func`` over some dimension of the inputs. Semantically, vmap 2025-12-04T09:17:16.0803899Z pushes the map into PyTorch operations called by ``func``, effectively 2025-12-04T09:17:16.0804356Z vectorizing those operations. 2025-12-04T09:17:16.0804554Z 2025-12-04T09:17:16.0804782Z vmap is useful for handling batch dimensions: one can write a function 2025-12-04T09:17:16.0805314Z ``func`` that runs on examples and then lift it to a function that can 2025-12-04T09:17:16.0805857Z take batches of examples with ``vmap(func)``. vmap can also be used to 2025-12-04T09:17:16.0806676Z compute batched gradients when composed with autograd. 2025-12-04T09:17:16.0806976Z 2025-12-04T09:17:16.0807081Z .. note:: 2025-12-04T09:17:16.0807364Z :func:`torch.vmap` is aliased to :func:`torch.func.vmap` for 2025-12-04T09:17:16.0807721Z convenience. Use whichever one you'd like. 2025-12-04T09:17:16.0807926Z 2025-12-04T09:17:16.0808008Z Args: 2025-12-04T09:17:16.0808272Z func (function): A Python function that takes one or more arguments. 2025-12-04T09:17:16.0808636Z Must return one or more Tensors. 2025-12-04T09:17:16.0808979Z in_dims (int or nested structure): Specifies which dimension of the 2025-12-04T09:17:16.0809378Z inputs should be mapped over. ``in_dims`` should have a 2025-12-04T09:17:16.0809780Z structure like the inputs. If the ``in_dim`` for a particular 2025-12-04T09:17:16.0810187Z input is None, then that indicates there is no map dimension. 2025-12-04T09:17:16.0810502Z Default: 0. 2025-12-04T09:17:16.0810787Z out_dims (int or Tuple[int]): Specifies where the mapped dimension 2025-12-04T09:17:16.0811211Z should appear in the outputs. If ``out_dims`` is a Tuple, then 2025-12-04T09:17:16.0811587Z it should have one element per output. Default: 0. 2025-12-04T09:17:16.0811954Z randomness (str): Specifies whether the randomness in this 2025-12-04T09:17:16.0812486Z vmap should be the same or different across batches. If 'different', 2025-12-04T09:17:16.0812929Z the randomness for each batch will be different. If 'same', the 2025-12-04T09:17:16.0813385Z randomness will be the same across batches. If 'error', any calls to 2025-12-04T09:17:16.0813838Z random functions will error. Default: 'error'. WARNING: this flag 2025-12-04T09:17:16.0814292Z only applies to random PyTorch operations and does not apply to 2025-12-04T09:17:16.0814669Z Python's random module or numpy randomness. 2025-12-04T09:17:16.0815062Z chunk_size (None or int): If None (default), apply a single vmap over inputs. 2025-12-04T09:17:16.0815528Z If not None, then compute the vmap :attr:`chunk_size` samples at a time. 2025-12-04T09:17:16.0816019Z Note that :attr:`chunk_size=1` is equivalent to computing the vmap with a for-loop. 2025-12-04T09:17:16.0816552Z If you run into memory issues computing the vmap, please try a non-None chunk_size. 2025-12-04T09:17:16.0816998Z 2025-12-04T09:17:16.0817069Z Returns: 2025-12-04T09:17:16.0817315Z Returns a new "batched" function. It takes the same inputs as 2025-12-04T09:17:16.0817641Z ``func``, except each input has an extra dimension at the index 2025-12-04T09:17:16.0817984Z specified by ``in_dims``. It takes returns the same outputs as 2025-12-04T09:17:16.0818331Z ``func``, except each output has an extra dimension at the index 2025-12-04T09:17:16.0818620Z specified by ``out_dims``. 2025-12-04T09:17:16.0818757Z 2025-12-04T09:17:16.0818818Z .. warning: 2025-12-04T09:17:16.0819209Z :func:`vmap` works best with functional-style code. Please do not 2025-12-04T09:17:16.0819568Z perform any side-effects in ``func``, with the exception of 2025-12-04T09:17:16.0819941Z in-place PyTorch operations. Examples of side-effects include mutating 2025-12-04T09:17:16.0820332Z Python data structures and assigning values to variables not captured 2025-12-04T09:17:16.0820621Z in ``func``. 2025-12-04T09:17:16.0820720Z 2025-12-04T09:17:16.0820887Z One example of using :func:`vmap` is to compute batched dot products. PyTorch 2025-12-04T09:17:16.0821263Z doesn't provide a batched ``torch.dot`` API; instead of unsuccessfully 2025-12-04T09:17:16.0821634Z rummaging through docs, use :func:`vmap` to construct a new function. 2025-12-04T09:17:16.0821861Z 2025-12-04T09:17:16.0821959Z >>> torch.dot # [D], [D] -> [] 2025-12-04T09:17:16.0822235Z >>> batched_dot = torch.func.vmap(torch.dot) # [N, D], [N, D] -> [N] 2025-12-04T09:17:16.0822527Z >>> x, y = torch.randn(2, 5), torch.randn(2, 5) 2025-12-04T09:17:16.0822804Z >>> batched_dot(x, y) 2025-12-04T09:17:16.0822933Z 2025-12-04T09:17:16.0823098Z :func:`vmap` can be helpful in hiding batch dimensions, leading to a simpler 2025-12-04T09:17:16.0823397Z model authoring experience. 2025-12-04T09:17:16.0823527Z 2025-12-04T09:17:16.0823602Z >>> batch_size, feature_size = 3, 5 2025-12-04T09:17:16.0823870Z >>> weights = torch.randn(feature_size, requires_grad=True) 2025-12-04T09:17:16.0824115Z >>> 2025-12-04T09:17:16.0824259Z >>> def model(feature_vec): 2025-12-04T09:17:16.0824469Z >>> # Very simple linear model with activation 2025-12-04T09:17:16.0824724Z >>> return feature_vec.dot(weights).relu() 2025-12-04T09:17:16.0824939Z >>> 2025-12-04T09:17:16.0825123Z >>> examples = torch.randn(batch_size, feature_size) 2025-12-04T09:17:16.0825386Z >>> result = torch.vmap(model)(examples) 2025-12-04T09:17:16.0825540Z 2025-12-04T09:17:16.0825710Z :func:`vmap` can also help vectorize computations that were previously difficult 2025-12-04T09:17:16.0826116Z or impossible to batch. One example is higher-order gradient computation. 2025-12-04T09:17:16.0826507Z The PyTorch autograd engine computes vjps (vector-Jacobian products). 2025-12-04T09:17:16.0826931Z Computing a full Jacobian matrix for some function f: R^N -> R^N usually 2025-12-04T09:17:16.0827327Z requires N calls to ``autograd.grad``, one per Jacobian row. Using :func:`vmap`, 2025-12-04T09:17:16.0827729Z we can vectorize the whole computation, computing the Jacobian in a single 2025-12-04T09:17:16.0828036Z call to ``autograd.grad``. 2025-12-04T09:17:16.0828153Z 2025-12-04T09:17:16.0828218Z >>> # Setup 2025-12-04T09:17:16.0828366Z >>> N = 5 2025-12-04T09:17:16.0828521Z >>> f = lambda x: x**2 2025-12-04T09:17:16.0828720Z >>> x = torch.randn(N, requires_grad=True) 2025-12-04T09:17:16.0828991Z >>> y = f(x) 2025-12-04T09:17:16.0829266Z >>> I_N = torch.eye(N) 2025-12-04T09:17:16.0829578Z >>> 2025-12-04T09:17:16.0829832Z >>> # Sequential approach 2025-12-04T09:17:16.0830307Z >>> jacobian_rows = [torch.autograd.grad(y, x, v, retain_graph=True)[0] 2025-12-04T09:17:16.0830809Z >>> for v in I_N.unbind()] 2025-12-04T09:17:16.0831053Z >>> jacobian = torch.stack(jacobian_rows) 2025-12-04T09:17:16.0831267Z >>> 2025-12-04T09:17:16.0831431Z >>> # vectorized gradient computation 2025-12-04T09:17:16.0831720Z >>> def get_vjp(v): 2025-12-04T09:17:16.0831910Z >>> return torch.autograd.grad(y, x, v) 2025-12-04T09:17:16.0832153Z >>> jacobian = torch.vmap(get_vjp)(I_N) 2025-12-04T09:17:16.0832314Z 2025-12-04T09:17:16.0832502Z :func:`vmap` can also be nested, producing an output with multiple batched dimensions 2025-12-04T09:17:16.0832757Z 2025-12-04T09:17:16.0832841Z >>> torch.dot # [D], [D] -> [] 2025-12-04T09:17:16.0833043Z >>> batched_dot = torch.vmap( 2025-12-04T09:17:16.0833249Z ... torch.vmap(torch.dot) 2025-12-04T09:17:16.0833519Z ... ) # [N1, N0, D], [N1, N0, D] -> [N1, N0] 2025-12-04T09:17:16.0833806Z >>> x, y = torch.randn(2, 3, 5), torch.randn(2, 3, 5) 2025-12-04T09:17:16.0834069Z >>> batched_dot(x, y) # tensor of size [2, 3] 2025-12-04T09:17:16.0834231Z 2025-12-04T09:17:16.0834407Z If the inputs are not batched along the first dimension, ``in_dims`` specifies 2025-12-04T09:17:16.0834767Z the dimension that each inputs are batched along as 2025-12-04T09:17:16.0834943Z 2025-12-04T09:17:16.0835013Z >>> torch.dot # [N], [N] -> [] 2025-12-04T09:17:16.0835297Z >>> batched_dot = torch.vmap(torch.dot, in_dims=1) # [N, D], [N, D] -> [D] 2025-12-04T09:17:16.0835609Z >>> x, y = torch.randn(2, 5), torch.randn(2, 5) 2025-12-04T09:17:16.0835820Z >>> batched_dot( 2025-12-04T09:17:16.0835981Z ... x, y 2025-12-04T09:17:16.0836212Z ... ) # output is [5] instead of [2] if batched along the 0th dimension 2025-12-04T09:17:16.0836417Z 2025-12-04T09:17:16.0836593Z If there are multiple inputs each of which is batched along different dimensions, 2025-12-04T09:17:16.0837038Z ``in_dims`` must be a tuple with the batch dimension for each input as 2025-12-04T09:17:16.0837255Z 2025-12-04T09:17:16.0837325Z >>> torch.dot # [D], [D] -> [] 2025-12-04T09:17:16.0837616Z >>> batched_dot = torch.vmap(torch.dot, in_dims=(0, None)) # [N, D], [D] -> [N] 2025-12-04T09:17:16.0837925Z >>> x, y = torch.randn(2, 5), torch.randn(5) 2025-12-04T09:17:16.0838142Z >>> batched_dot( 2025-12-04T09:17:16.0838302Z ... x, y 2025-12-04T09:17:16.0838517Z ... ) # second arg doesn't have a batch dim because in_dim[1] was None 2025-12-04T09:17:16.0838730Z 2025-12-04T09:17:16.0838888Z If the input is a Python struct, ``in_dims`` must be a tuple containing a struct 2025-12-04T09:17:16.0839201Z matching the shape of the input: 2025-12-04T09:17:16.0839337Z 2025-12-04T09:17:16.0839449Z >>> f = lambda dict: torch.dot(dict["x"], dict["y"]) 2025-12-04T09:17:16.0839697Z >>> x, y = torch.randn(2, 5), torch.randn(5) 2025-12-04T09:17:16.0839920Z >>> input = {"x": x, "y": y} 2025-12-04T09:17:16.0840166Z >>> batched_dot = torch.vmap(f, in_dims=({"x": 0, "y": None},)) 2025-12-04T09:17:16.0840424Z >>> batched_dot(input) 2025-12-04T09:17:16.0840545Z 2025-12-04T09:17:16.0840727Z By default, the output is batched along the first dimension. However, it can be batched 2025-12-04T09:17:16.0841070Z along any dimension by using ``out_dims`` 2025-12-04T09:17:16.0841222Z 2025-12-04T09:17:16.0841295Z >>> f = lambda x: x**2 2025-12-04T09:17:16.0841474Z >>> x = torch.randn(2, 5) 2025-12-04T09:17:16.0841686Z >>> batched_pow = torch.vmap(f, out_dims=1) 2025-12-04T09:17:16.0841913Z >>> batched_pow(x) # [5, 2] 2025-12-04T09:17:16.0842036Z 2025-12-04T09:17:16.0842234Z For any function that uses kwargs, the returned function will not batch the kwargs but will 2025-12-04T09:17:16.0842584Z accept kwargs 2025-12-04T09:17:16.0842684Z 2025-12-04T09:17:16.0842754Z >>> x = torch.randn([2, 5]) 2025-12-04T09:17:16.0842948Z >>> def fn(x, scale=4.): 2025-12-04T09:17:16.0843134Z >>> return x * scale 2025-12-04T09:17:16.0843309Z >>> 2025-12-04T09:17:16.0843467Z >>> batched_pow = torch.vmap(fn) 2025-12-04T09:17:16.0843693Z >>> assert torch.allclose(batched_pow(x), x * 4) 2025-12-04T09:17:16.0844013Z >>> batched_pow(x, scale=x) # scale is not batched, output has shape [2, 2, 5] 2025-12-04T09:17:16.0844302Z 2025-12-04T09:17:16.0844378Z .. note:: 2025-12-04T09:17:16.0844609Z vmap does not provide general autobatching or handle variable-length 2025-12-04T09:17:16.0844910Z sequences out of the box. 2025-12-04T09:17:16.0845032Z 2025-12-04T09:17:16.0845396Z Original Error: IndentationError('expected an indented block after function definition on line 4', ('', 5, 1, '_._ = None\n', 5, 2)) 2025-12-04T09:17:16.0845813Z 2025-12-04T09:17:16.0845877Z _._ = None 2025-12-04T09:17:16.0846030Z ^ 2025-12-04T09:17:16.0846507Z msg = Cannot scrape callname=grad in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/apis.py line=306. 2025-12-04T09:17:16.0847112Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:16.0847520Z ``grad`` operator helps computing gradients of ``func`` with respect to the 2025-12-04T09:17:16.0847892Z input(s) specified by ``argnums``. This operator can be nested to 2025-12-04T09:17:16.0848180Z compute higher-order gradients. 2025-12-04T09:17:16.0848332Z 2025-12-04T09:17:16.0848389Z Args: 2025-12-04T09:17:16.0848615Z func (Callable): A Python function that takes one or more arguments. 2025-12-04T09:17:16.0849005Z Must return a single-element Tensor. If specified ``has_aux`` equals ``True``, 2025-12-04T09:17:16.0849444Z function can return a tuple of single-element Tensor and other auxiliary objects: 2025-12-04T09:17:16.0849772Z ``(output, aux)``. 2025-12-04T09:17:16.0850086Z argnums (int or Tuple[int]): Specifies arguments to compute gradients with respect to. 2025-12-04T09:17:16.0850545Z ``argnums`` can be single integer or tuple of integers. Default: 0. 2025-12-04T09:17:16.0850907Z has_aux (bool): Flag indicating that ``func`` returns a tensor and other 2025-12-04T09:17:16.0851236Z auxiliary objects: ``(output, aux)``. Default: False. 2025-12-04T09:17:16.0851418Z 2025-12-04T09:17:16.0851474Z Returns: 2025-12-04T09:17:16.0851753Z Function to compute gradients with respect to its inputs. By default, the output of 2025-12-04T09:17:16.0852175Z the function is the gradient tensor(s) with respect to the first argument. 2025-12-04T09:17:16.0852698Z If specified ``has_aux`` equals ``True``, tuple of gradients and output auxiliary objects 2025-12-04T09:17:16.0853112Z is returned. If ``argnums`` is a tuple of integers, a tuple of output gradients with 2025-12-04T09:17:16.0853453Z respect to each ``argnums`` value is returned. 2025-12-04T09:17:16.0853618Z 2025-12-04T09:17:16.0853697Z Example of using ``grad``: 2025-12-04T09:17:16.0853824Z 2025-12-04T09:17:16.0853898Z >>> # xdoctest: +SKIP 2025-12-04T09:17:16.0854089Z >>> from torch.func import grad 2025-12-04T09:17:16.0854301Z >>> x = torch.randn([]) 2025-12-04T09:17:16.0854522Z >>> cos_x = grad(lambda x: torch.sin(x))(x) 2025-12-04T09:17:16.0854762Z >>> assert torch.allclose(cos_x, x.cos()) 2025-12-04T09:17:16.0854976Z >>> 2025-12-04T09:17:16.0855131Z >>> # Second-order gradients 2025-12-04T09:17:16.0855362Z >>> neg_sin_x = grad(grad(lambda x: torch.sin(x)))(x) 2025-12-04T09:17:16.0855636Z >>> assert torch.allclose(neg_sin_x, -x.sin()) 2025-12-04T09:17:16.0855796Z 2025-12-04T09:17:16.0855972Z When composed with ``vmap``, ``grad`` can be used to compute per-sample-gradients: 2025-12-04T09:17:16.0856216Z 2025-12-04T09:17:16.0856290Z >>> # xdoctest: +SKIP 2025-12-04T09:17:16.0856487Z >>> from torch.func import grad, vmap 2025-12-04T09:17:16.0856719Z >>> batch_size, feature_size = 3, 5 2025-12-04T09:17:16.0856932Z >>> 2025-12-04T09:17:16.0857088Z >>> def model(weights, feature_vec): 2025-12-04T09:17:16.0857326Z >>> # Very simple linear model with activation 2025-12-04T09:17:16.0857569Z >>> assert feature_vec.dim() == 1 2025-12-04T09:17:16.0857792Z >>> return feature_vec.dot(weights).relu() 2025-12-04T09:17:16.0858082Z >>> 2025-12-04T09:17:16.0858266Z >>> def compute_loss(weights, example, target): 2025-12-04T09:17:16.0858500Z >>> y = model(weights, example) 2025-12-04T09:17:16.0858751Z >>> return ((y - target) ** 2).mean() # MSELoss 2025-12-04T09:17:16.0858976Z >>> 2025-12-04T09:17:16.0859178Z >>> weights = torch.randn(feature_size, requires_grad=True) 2025-12-04T09:17:16.0859466Z >>> examples = torch.randn(batch_size, feature_size) 2025-12-04T09:17:16.0859719Z >>> targets = torch.randn(batch_size) 2025-12-04T09:17:16.0859995Z >>> inputs = (weights, examples, targets) 2025-12-04T09:17:16.0860345Z >>> grad_weight_per_example = vmap(grad(compute_loss), in_dims=(None, 0, 0))( 2025-12-04T09:17:16.0860646Z ... *inputs 2025-12-04T09:17:16.0860808Z ... ) 2025-12-04T09:17:16.0860894Z 2025-12-04T09:17:16.0861014Z Example of using ``grad`` with ``has_aux`` and ``argnums``: 2025-12-04T09:17:16.0861214Z 2025-12-04T09:17:16.0861278Z >>> # xdoctest: +SKIP 2025-12-04T09:17:16.0861474Z >>> from torch.func import grad 2025-12-04T09:17:16.0861693Z >>> def my_loss_func(y, y_pred): 2025-12-04T09:17:16.0862162Z >>> loss_per_sample = (0.5 * y_pred - y) ** 2 2025-12-04T09:17:16.0862420Z >>> loss = loss_per_sample.mean() 2025-12-04T09:17:16.0862658Z >>> return loss, (y_pred, loss_per_sample) 2025-12-04T09:17:16.0862866Z >>> 2025-12-04T09:17:16.0863051Z >>> fn = grad(my_loss_func, argnums=(0, 1), has_aux=True) 2025-12-04T09:17:16.0863307Z >>> y_true = torch.rand(4) 2025-12-04T09:17:16.0863604Z >>> y_preds = torch.rand(4, requires_grad=True) 2025-12-04T09:17:16.0863844Z >>> out = fn(y_true, y_preds) 2025-12-04T09:17:16.0864149Z >>> # > output is ((grads w.r.t y_true, grads w.r.t y_preds), (y_pred, loss_per_sample)) 2025-12-04T09:17:16.0864391Z 2025-12-04T09:17:16.0864455Z .. note:: 2025-12-04T09:17:16.0864654Z Using PyTorch ``torch.no_grad`` together with ``grad``. 2025-12-04T09:17:16.0864849Z 2025-12-04T09:17:16.0864949Z Case 1: Using ``torch.no_grad`` inside a function: 2025-12-04T09:17:16.0865122Z 2025-12-04T09:17:16.0865196Z >>> # xdoctest: +SKIP 2025-12-04T09:17:16.0865390Z >>> def f(x): 2025-12-04T09:17:16.0865566Z >>> with torch.no_grad(): 2025-12-04T09:17:16.0865772Z >>> c = x ** 2 2025-12-04T09:17:16.0865966Z >>> return x - c 2025-12-04T09:17:16.0866090Z 2025-12-04T09:17:16.0866223Z In this case, ``grad(f)(x)`` will respect the inner ``torch.no_grad``. 2025-12-04T09:17:16.0866442Z 2025-12-04T09:17:16.0866568Z Case 2: Using ``grad`` inside ``torch.no_grad`` context manager: 2025-12-04T09:17:16.0866784Z 2025-12-04T09:17:16.0866855Z >>> # xdoctest: +SKIP 2025-12-04T09:17:16.0867052Z >>> with torch.no_grad(): 2025-12-04T09:17:16.0867247Z >>> grad(f)(x) 2025-12-04T09:17:16.0867375Z 2025-12-04T09:17:16.0867523Z In this case, ``grad`` will respect the inner ``torch.no_grad``, but not the 2025-12-04T09:17:16.0867900Z outer one. This is because ``grad`` is a "function transform": its result 2025-12-04T09:17:16.0868270Z should not depend on the result of a context manager outside of ``f``. 2025-12-04T09:17:16.0868497Z 2025-12-04T09:17:16.0868550Z 2025-12-04T09:17:16.0868981Z Original Error: IndentationError('expected an indented block after function definition on line 5', ('', 6, 1, '_._ = None\n', 6, 2)) 2025-12-04T09:17:16.0869401Z 2025-12-04T09:17:16.0869466Z _._ = None 2025-12-04T09:17:16.0869612Z ^ 2025-12-04T09:17:18.7476579Z msg = Cannot scrape callname=CustomOpDef.register_fake in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py line=402. 2025-12-04T09:17:18.7477778Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:18.7478602Z Register a FakeTensor implementation for this custom op. 2025-12-04T09:17:18.7478879Z 2025-12-04T09:17:18.7479108Z This is necessary to get the operator to work efficiently with torch.compile. 2025-12-04T09:17:18.7479450Z 2025-12-04T09:17:18.7479661Z The Fake impl (sometimes also known as a meta kernel or abstract impl) 2025-12-04T09:17:18.7480172Z specifies the behavior of this operator on Tensors that carry no data. 2025-12-04T09:17:18.7480610Z Given some input Tensors with certain properties 2025-12-04T09:17:18.7481180Z (sizes/strides/storage_offset/device), it specifies what the properties of 2025-12-04T09:17:18.7481695Z the output Tensors are. 2025-12-04T09:17:18.7481864Z 2025-12-04T09:17:18.7482069Z Please see :func:`torch.library.register_fake` for more details. 2025-12-04T09:17:18.7482358Z 2025-12-04T09:17:18.7482432Z Args: 2025-12-04T09:17:18.7482703Z fn (Callable): The function to register as the FakeTensor 2025-12-04T09:17:18.7483056Z implementation. 2025-12-04T09:17:18.7483228Z 2025-12-04T09:17:18.7483306Z Examples: 2025-12-04T09:17:18.7483521Z >>> import torch 2025-12-04T09:17:18.7483771Z >>> import numpy as np 2025-12-04T09:17:18.7484038Z >>> from torch import Tensor 2025-12-04T09:17:18.7484304Z >>> 2025-12-04T09:17:18.7484591Z >>> # Example 1: an operator without data-dependent output shape 2025-12-04T09:17:18.7485053Z >>> @torch.library.custom_op("mylib::linear", mutates_args=()) 2025-12-04T09:17:18.7485502Z >>> def linear(x: Tensor, weight: Tensor, bias: Tensor) -> Tensor: 2025-12-04T09:17:18.7485986Z >>> return (x @ weight.t()) + bias 2025-12-04T09:17:18.7486267Z >>> 2025-12-04T09:17:18.7486475Z >>> @linear.register_fake 2025-12-04T09:17:18.7486750Z >>> def _(x, weight, bias): 2025-12-04T09:17:18.7487022Z >>> assert x.dim() == 2 2025-12-04T09:17:18.7487298Z >>> assert weight.dim() == 2 2025-12-04T09:17:18.7487564Z >>> assert bias.dim() == 1 2025-12-04T09:17:18.7487811Z >>> assert x.shape[1] == weight.shape[1] 2025-12-04T09:17:18.7488074Z >>> assert weight.shape[0] == bias.shape[0] 2025-12-04T09:17:18.7488322Z >>> assert x.device == weight.device 2025-12-04T09:17:18.7488582Z >>> return x.new_empty(x.size(0), weight.size(0)) 2025-12-04T09:17:18.7488815Z >>> 2025-12-04T09:17:18.7488972Z >>> x = torch.randn(2, 2) 2025-12-04T09:17:18.7489200Z >>> weight = torch.randn(2, 2) 2025-12-04T09:17:18.7489425Z >>> bias = torch.randn(2) 2025-12-04T09:17:18.7489651Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T09:17:18.7489951Z >>> out = torch.compile(linear, fullgraph=True)(x, weight, bias) 2025-12-04T09:17:18.7490244Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T09:17:18.7490573Z >>> assert torch.allclose(out, torch.nn.functional.linear(x, weight, bias)) 2025-12-04T09:17:18.7490870Z >>> 2025-12-04T09:17:18.7491079Z >>> # Example 2: an operator with data-dependent output shape 2025-12-04T09:17:18.7491411Z >>> @torch.library.custom_op("mylib::nonzero", mutates_args=()) 2025-12-04T09:17:18.7491692Z >>> def nonzero(x: Tensor) -> Tensor: 2025-12-04T09:17:18.7491917Z >>> x_np = x.cpu().numpy() 2025-12-04T09:17:18.7492156Z >>> res = np.stack(np.nonzero(x_np), axis=1) 2025-12-04T09:17:18.7492508Z >>> return torch.tensor(res, device=x.device) 2025-12-04T09:17:18.7492741Z >>> 2025-12-04T09:17:18.7492903Z >>> @nonzero.register_fake 2025-12-04T09:17:18.7493105Z >>> def _(x): 2025-12-04T09:17:18.7493315Z >>> # Number of nonzero-elements is data-dependent. 2025-12-04T09:17:18.7493610Z >>> # Since we cannot peek at the data in an abstract impl, 2025-12-04T09:17:18.7493956Z >>> # we use the ctx object to construct a new symint that 2025-12-04T09:17:18.7494218Z >>> # represents the data-dependent size. 2025-12-04T09:17:18.7494461Z >>> ctx = torch.library.get_ctx() 2025-12-04T09:17:18.7494691Z >>> nnz = ctx.new_dynamic_size() 2025-12-04T09:17:18.7494911Z >>> shape = [nnz, x.dim()] 2025-12-04T09:17:18.7495149Z >>> result = x.new_empty(shape, dtype=torch.int64) 2025-12-04T09:17:18.7495395Z >>> return result 2025-12-04T09:17:18.7495623Z >>> 2025-12-04T09:17:18.7495826Z >>> x = torch.tensor([0, 1, 2, 0, 0, 1]) 2025-12-04T09:17:18.7496069Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T09:17:18.7496334Z >>> out = torch.compile(nonzero, fullgraph=True)(x) 2025-12-04T09:17:18.7496588Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T09:17:18.7496843Z >>> assert torch.allclose(out, x.nonzero()) 2025-12-04T09:17:18.7497010Z 2025-12-04T09:17:18.7497064Z 2025-12-04T09:17:18.7497512Z Original Error: IndentationError('expected an indented block after function definition on line 36', ('', 37, 1, '_._ = None\n', 37, 2)) 2025-12-04T09:17:18.7497944Z 2025-12-04T09:17:18.7497999Z _._ = None 2025-12-04T09:17:18.7498132Z ^ 2025-12-04T09:17:18.7644878Z msg = Cannot scrape callname=unsafe_generate_fake_kernels in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_profile.py line=94. 2025-12-04T09:17:18.7646115Z Caused by: DoctestParseError('Failed to parse doctest in _label_docsrc_lines') 2025-12-04T09:17:18.7646462Z 2025-12-04T09:17:18.7646703Z Registers a fake kernel based on the given operator profiles. This fake 2025-12-04T09:17:18.7647342Z kernel registration will override any existing fake kernel registrations. 2025-12-04T09:17:18.7647719Z 2025-12-04T09:17:18.7647985Z The input is a dictionary mapping operator names to a set of operator 2025-12-04T09:17:18.7648457Z profiles, which we will use to generate fake kernels. The operator profiles 2025-12-04T09:17:18.7649079Z are a record of the input and output tensor metadata. Based on this 2025-12-04T09:17:18.7649666Z information we will match a given input to the recorded profile, and return 2025-12-04T09:17:18.7650179Z an output with the same metadata as in the recorded profile. If a profile 2025-12-04T09:17:18.7650720Z doesn't exist then an exception will be thrown. 2025-12-04T09:17:18.7650941Z 2025-12-04T09:17:18.7651136Z The fake kernel generation is considered unsafe because it relies on the 2025-12-04T09:17:18.7651774Z rigid, pre-defined operator profiles that do not account for potential 2025-12-04T09:17:18.7652442Z variations in output behavior. Specifically, the generated kernels assume a 2025-12-04T09:17:18.7652997Z fixed relationship between input and output ranks. However, in reality, it's 2025-12-04T09:17:18.7653661Z possible that data-dependent operations may produce outputs of different 2025-12-04T09:17:18.7654218Z ranks even when given inputs of the same rank. The generated fake kernels 2025-12-04T09:17:18.7654764Z are inflexible and unable to accommodate these nuances, making them 2025-12-04T09:17:18.7655232Z potentially unsafe. 2025-12-04T09:17:18.7655395Z 2025-12-04T09:17:18.7655464Z Args: 2025-12-04T09:17:18.7655752Z op_profiles (dict[str, set[OpProfile]]): A dictionary mapping operator 2025-12-04T09:17:18.7656351Z name to a set of operator profiles from which we will generate fake 2025-12-04T09:17:18.7656698Z kernels. 2025-12-04T09:17:18.7656841Z 2025-12-04T09:17:18.7656984Z Examples: 2025-12-04T09:17:18.7657143Z 2025-12-04T09:17:18.7657296Z >>> # Example: Registering an op-profile from draft-export 2025-12-04T09:17:18.7657600Z >>> import torch 2025-12-04T09:17:18.7657900Z >>> from torch.export._draft_export import draft_export 2025-12-04T09:17:18.7658180Z >>> 2025-12-04T09:17:18.7658476Z >>> @torch.library.custom_op("mylib::foo", mutates_args=()) 2025-12-04T09:17:18.7658906Z >>> def foo(x: Tensor, y: Tensor) -> Tensor: 2025-12-04T09:17:18.7659139Z >>> return x + y 2025-12-04T09:17:18.7659307Z >>> 2025-12-04T09:17:18.7659454Z >>> class M(torch.nn.Module): 2025-12-04T09:17:18.7659772Z >>> def forward(self, a, b): 2025-12-04T09:17:18.7660019Z >>> res = torch.ops.mylib.foo(a, b) # no fake impl 2025-12-04T09:17:18.7660251Z >>> return res 2025-12-04T09:17:18.7660457Z >>> 2025-12-04T09:17:18.7660826Z >>> ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4)) 2025-12-04T09:17:18.7661128Z >>> 2025-12-04T09:17:18.7661519Z >>> with torch._library.fake_profile.unsafe_generate_fake_kernels(ep._report.op_profiles): 2025-12-04T09:17:18.7662122Z >>> decomp = ep.run_decompositions() 2025-12-04T09:17:18.7662368Z 2025-12-04T09:17:18.7662373Z 2025-12-04T09:17:18.7662759Z Original Error: IncompleteParseError('ill-formed doctest: all parts have been processed but the doctest source is not balanced') 2025-12-04T09:17:18.7663236Z 2025-12-04T09:17:19.2063769Z msg = Cannot scrape callname=ActivationSparsifier in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/activation_sparsifier/activation_sparsifier.py line=16. 2025-12-04T09:17:19.2065107Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:19.2065504Z 2025-12-04T09:17:19.2065797Z The Activation sparsifier class aims to sparsify/prune activations in a neural 2025-12-04T09:17:19.2066457Z network. The idea is to attach the sparsifier to a layer (or layers) and it 2025-12-04T09:17:19.2067362Z zeroes out the activations based on the mask_fn (or sparsification function) 2025-12-04T09:17:19.2067847Z input by the user. 2025-12-04T09:17:19.2068234Z The mask_fn is applied once all the inputs are aggregated and reduced i.e. 2025-12-04T09:17:19.2068650Z mask = mask_fn(reduce_fn(aggregate_fn(activations))) 2025-12-04T09:17:19.2068885Z 2025-12-04T09:17:19.2068973Z Note:: 2025-12-04T09:17:19.2069343Z The sparsification mask is computed on the input **before it goes through the attached layer**. 2025-12-04T09:17:19.2069713Z 2025-12-04T09:17:19.2069779Z Args: 2025-12-04T09:17:19.2069955Z model (nn.Module): 2025-12-04T09:17:19.2070276Z The model whose layers will be sparsified. The layers that needs to be 2025-12-04T09:17:19.2070754Z sparsified should be added separately using the register_layer() function 2025-12-04T09:17:19.2071153Z aggregate_fn (Optional, Callable): 2025-12-04T09:17:19.2071559Z default aggregate_fn that is used if not specified while registering the layer. 2025-12-04T09:17:19.2072022Z specifies how inputs should be aggregated over time. 2025-12-04T09:17:19.2072495Z The aggregate_fn should usually take 2 torch tensors and return the aggregated tensor. 2025-12-04T09:17:19.2072908Z Example 2025-12-04T09:17:19.2073171Z def add_agg_fn(tensor1, tensor2): return tensor1 + tensor2 2025-12-04T09:17:19.2073502Z reduce_fn (Optional, Callable): 2025-12-04T09:17:19.2073896Z default reduce_fn that is used if not specified while registering the layer. 2025-12-04T09:17:19.2074416Z reduce_fn will be called on the aggregated tensor i.e. the tensor obtained after 2025-12-04T09:17:19.2074820Z calling agg_fn() on all inputs. 2025-12-04T09:17:19.2075085Z Example 2025-12-04T09:17:19.2075376Z def mean_reduce_fn(agg_tensor): return agg_tensor.mean(dim=0) 2025-12-04T09:17:19.2075731Z mask_fn (Optional, Callable): 2025-12-04T09:17:19.2076165Z default mask_fn that is used to create the sparsification mask using the tensor obtained after 2025-12-04T09:17:19.2076729Z calling the reduce_fn(). This is used by default if a custom one is passed in the 2025-12-04T09:17:19.2077123Z register_layer(). 2025-12-04T09:17:19.2077688Z Note that the mask_fn() definition should contain the sparse arguments that is passed in sparse_config 2025-12-04T09:17:19.2078167Z arguments. 2025-12-04T09:17:19.2078389Z features (Optional, list): 2025-12-04T09:17:19.2078632Z default selected features to sparsify. 2025-12-04T09:17:19.2078979Z If this is non-empty, then the mask_fn will be applied for each feature of the input. 2025-12-04T09:17:19.2079301Z For example, 2025-12-04T09:17:19.2079676Z mask = [mask_fn(reduce_fn(aggregated_fn(input[feature])) for feature in features] 2025-12-04T09:17:19.2080102Z feature_dim (Optional, int): 2025-12-04T09:17:19.2080437Z default dimension of input features. Again, features along this dim will be chosen 2025-12-04T09:17:19.2080791Z for sparsification. 2025-12-04T09:17:19.2081004Z sparse_config (Dict): 2025-12-04T09:17:19.2081289Z Default configuration for the mask_fn. This config will be passed 2025-12-04T09:17:19.2081604Z with the mask_fn() 2025-12-04T09:17:19.2081746Z 2025-12-04T09:17:19.2081806Z Example: 2025-12-04T09:17:19.2081961Z >>> # xdoctest: +SKIP 2025-12-04T09:17:19.2082140Z >>> model = SomeModel() 2025-12-04T09:17:19.2082430Z >>> act_sparsifier = ActivationSparsifier(...) # init activation sparsifier 2025-12-04T09:17:19.2082752Z >>> # Initialize aggregate_fn 2025-12-04T09:17:19.2082958Z >>> def agg_fn(x, y): 2025-12-04T09:17:19.2083135Z >>> return x + y 2025-12-04T09:17:19.2083309Z >>> 2025-12-04T09:17:19.2083455Z >>> # Initialize reduce_fn 2025-12-04T09:17:19.2083693Z >>> def reduce_fn(x): 2025-12-04T09:17:19.2083882Z >>> return torch.mean(x, dim=0) 2025-12-04T09:17:19.2084079Z >>> 2025-12-04T09:17:19.2084225Z >>> # Initialize mask_fn 2025-12-04T09:17:19.2084422Z >>> def mask_fn(data): 2025-12-04T09:17:19.2084627Z >>> return torch.eye(data.shape).to(data.device) 2025-12-04T09:17:19.2084863Z >>> 2025-12-04T09:17:19.2084997Z >>> 2025-12-04T09:17:19.2085151Z >>> act_sparsifier.register_layer( 2025-12-04T09:17:19.2085373Z ... model.some_layer, 2025-12-04T09:17:19.2085570Z ... aggregate_fn=agg_fn, 2025-12-04T09:17:19.2085772Z ... reduce_fn=reduce_fn, 2025-12-04T09:17:19.2085956Z ... mask_fn=mask_fn, 2025-12-04T09:17:19.2086147Z ... ) 2025-12-04T09:17:19.2086290Z >>> 2025-12-04T09:17:19.2086433Z >>> # start training process 2025-12-04T09:17:19.2086622Z >>> for _ in [...]: 2025-12-04T09:17:19.2086798Z >>> # epoch starts 2025-12-04T09:17:19.2087020Z >>> # model.forward(), compute_loss() and model.backwards() 2025-12-04T09:17:19.2087278Z >>> # epoch ends 2025-12-04T09:17:19.2087452Z >>> act_sparsifier.step() 2025-12-04T09:17:19.2087650Z >>> # end training process 2025-12-04T09:17:19.2087845Z >>> sparsifier.squash_mask() 2025-12-04T09:17:19.2087981Z 2025-12-04T09:17:19.2088331Z Original Error: IndentationError("expected an indented block after 'for' statement on line 25", ('', 26, 1, '_._ = None\n', 26, 2)) 2025-12-04T09:17:19.2088752Z 2025-12-04T09:17:19.2088817Z _._ = None 2025-12-04T09:17:19.2088955Z ^ 2025-12-04T09:17:19.7411870Z msg = Cannot scrape callname=DeviceMesh.__getitem__ in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py line=547. 2025-12-04T09:17:19.7412986Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:19.7413356Z 2025-12-04T09:17:19.7413660Z Slice the current DeviceMesh based on the mesh_dim_names given to create a submesh. 2025-12-04T09:17:19.7414296Z The submesh created consists of the dimensions and the communicators indicated by 2025-12-04T09:17:19.7414757Z ``mesh_dim_names`` 2025-12-04T09:17:19.7414897Z 2025-12-04T09:17:19.7414982Z Args: 2025-12-04T09:17:19.7415320Z mesh_dim_names (Union[str, Tuple[str]]): the name or the tuple of names of the 2025-12-04T09:17:19.7416111Z mesh dimension of the DeviceMesh to create the submesh for. 2025-12-04T09:17:19.7416479Z Returns: 2025-12-04T09:17:19.7416698Z A :class:`DeviceMesh` object 2025-12-04T09:17:19.7416884Z 2025-12-04T09:17:19.7417160Z The following program runs on each process/rank in an SPMD manner in a world size of 8. 2025-12-04T09:17:19.7417644Z In the first example: 2025-12-04T09:17:19.7418068Z Calling mesh_2d["tp"] on rank 0, 1, 2, 3 returns a 1D submesh of DeviceMesh:([0, 1, 2, 3]). 2025-12-04T09:17:19.7418707Z Calling mesh_2d["tp"] on rank 4, 5, 6, 7 returns a 1D submesh of DeviceMesh:([4, 5, 6, 7]). 2025-12-04T09:17:19.7419267Z Calling mesh_2d["dp"] on rank 0, 4 returns a 1D submesh of DeviceMesh:([0, 4]). 2025-12-04T09:17:19.7419747Z Calling mesh_2d["dp"] on rank 1, 5 returns a 1D submesh of DeviceMesh:([1, 5]). 2025-12-04T09:17:19.7420222Z Calling mesh_2d["dp"] on rank 2, 6 returns a 1D submesh of DeviceMesh:([2, 6]). 2025-12-04T09:17:19.7420685Z Calling mesh_2d["dp"] on rank 3, 7 returns a 1D submesh of DeviceMesh:([3, 7]). 2025-12-04T09:17:19.7420974Z 2025-12-04T09:17:19.7421055Z In the second example: 2025-12-04T09:17:19.7421413Z Calling mesh_3d["dp", "cp"] on rank 0, 1, 4, 5 returns a 2D submesh of DeviceMesh:([[0, 1], [4, 5]]). 2025-12-04T09:17:19.7421940Z Calling mesh_3d["dp", "cp"] on rank 2, 3, 6, 7 returns a 2D submesh of DeviceMesh:([[2, 3], [6, 7]]). 2025-12-04T09:17:19.7422471Z Calling mesh_3d["cp", "dp"] on rank 0, 1, 4, 5 returns a 2D submesh of DeviceMesh:([[0, 4], [1, 5]]). 2025-12-04T09:17:19.7423071Z Calling mesh_3d["cp", "dp"] on rank 2, 3, 6, 7 returns a 2D submesh of DeviceMesh:([[2, 6], [3, 7]]). 2025-12-04T09:17:19.7423387Z 2025-12-04T09:17:19.7423484Z Example:: 2025-12-04T09:17:19.7423587Z 2025-12-04T09:17:19.7423675Z >>> # xdoctest: +SKIP("no rank") 2025-12-04T09:17:19.7423999Z >>> from torch.distributed.device_mesh import DeviceMesh 2025-12-04T09:17:19.7424308Z >>> 2025-12-04T09:17:19.7424577Z >>> # Initialize a 2D device mesh as (2, 4) to represent the topology 2025-12-04T09:17:19.7424953Z >>> # of cross-host(dim 0), and within-host (dim 1). 2025-12-04T09:17:19.7425383Z >>> mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp")) 2025-12-04T09:17:19.7425784Z >>> tp_mesh = mesh_2d["tp"] 2025-12-04T09:17:19.7426019Z >>> dp_mesh = mesh_2d["dp"] 2025-12-04T09:17:19.7426236Z >>> 2025-12-04T09:17:19.7426409Z >>> # Initialize a 3D mesh. 2025-12-04T09:17:19.7426798Z >>> mesh_3d = init_device_mesh(device_type="cuda", (2,2,2), mesh_dim_names=("dp", "pp", "cp")) 2025-12-04T09:17:19.7427383Z >>> # The order of the mesh_dim_names provided deteremines the order of dimensions in the submesh. 2025-12-04T09:17:19.7427819Z >>> dp_cp_mesh = mesh_3d["dp", "cp"] 2025-12-04T09:17:19.7428084Z >>> cp_dp_mesh = mesh_3d["cp", "dp"] 2025-12-04T09:17:19.7428256Z 2025-12-04T09:17:19.7428820Z Original Error: SyntaxError('positional argument follows keyword argument', ('', 6, 82, 'mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp"))\n', 6, 83)) 2025-12-04T09:17:19.7429448Z 2025-12-04T09:17:19.7429613Z mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp")) 2025-12-04T09:17:19.7429935Z ^ 2025-12-04T09:17:20.1366502Z msg = Cannot scrape callname=SavePlanner in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/planner.py line=122. 2025-12-04T09:17:20.1367593Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:20.1368024Z 2025-12-04T09:17:20.1368317Z Abstract class defining the protocol used by save_state_dict to plan the save process. 2025-12-04T09:17:20.1368665Z 2025-12-04T09:17:20.1368922Z SavePlanners are stateful objects that can be used to customize the whole save process. 2025-12-04T09:17:20.1369275Z 2025-12-04T09:17:20.1369688Z SavePlanner acts as an access proxy to the state_dict, so any transformation done to it 2025-12-04T09:17:20.1370116Z will be visible to the whole process. 2025-12-04T09:17:20.1370312Z 2025-12-04T09:17:20.1370545Z A planner subclass can expect the following sequence of calls during save_state_dict: 2025-12-04T09:17:20.1370877Z 2025-12-04T09:17:20.1370992Z 1) set_up_planner - called on all ranks. 2025-12-04T09:17:20.1371281Z Signals the start of a checkpoint save. 2025-12-04T09:17:20.1371482Z 2025-12-04T09:17:20.1371583Z 2) create_local_plan - called on all ranks. 2025-12-04T09:17:20.1372108Z Process the state_dict and produces a `SavePlan` that will be sent for global planning. 2025-12-04T09:17:20.1372637Z 2025-12-04T09:17:20.1372800Z 3) create_global_plan - called on the coordinator rank only. 2025-12-04T09:17:20.1373209Z Takes the SavePlan from all ranks and make any global decision. 2025-12-04T09:17:20.1373474Z 2025-12-04T09:17:20.1373568Z 4) finish_plan - called on all ranks. 2025-12-04T09:17:20.1373929Z This gives each rank a chance to adjust to global planning decisions. 2025-12-04T09:17:20.1374211Z 2025-12-04T09:17:20.1374336Z 5) resolve_data - called multiple times on each rank 2025-12-04T09:17:20.1374724Z Lookups a value on the `state_dict` for the storage layer to write. 2025-12-04T09:17:20.1374996Z 2025-12-04T09:17:20.1375246Z Users are recommended to extend DefaultSavePlanner instead of this interface directly as 2025-12-04T09:17:20.1375750Z most changes can be expressed by changes in a single method. 2025-12-04T09:17:20.1375994Z 2025-12-04T09:17:20.1376096Z There are 3 usual patterns of extension: 2025-12-04T09:17:20.1376288Z 2025-12-04T09:17:20.1376577Z Rewriting state_dict. This is the simplest way to extend the save process as it 2025-12-04T09:17:20.1377074Z doesn't requite understanding the intrincacies of how SavePlan works: 2025-12-04T09:17:20.1377355Z 2025-12-04T09:17:20.1377453Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:20.1377737Z >>> class RenamePlanner(DefaultSavePlanner): 2025-12-04T09:17:20.1378030Z >>> def set_up_planner( 2025-12-04T09:17:20.1378253Z >>> self, 2025-12-04T09:17:20.1378452Z >>> state_dict: STATE_DICT_TYPE, 2025-12-04T09:17:20.1378707Z >>> storage_meta: Optional[StorageMeta], 2025-12-04T09:17:20.1378951Z >>> is_coordinator: bool, 2025-12-04T09:17:20.1379146Z >>> ) -> None: 2025-12-04T09:17:20.1379323Z >>> # prefix all keys with `foo_`` 2025-12-04T09:17:20.1379665Z >>> super().set_up_planner({"foo_" + k: v for k, v in state_dict.items()}, storage_meta, is_coordinator) 2025-12-04T09:17:20.1379945Z 2025-12-04T09:17:20.1380184Z Modifying local plan and lookup in tandem. This is useful when fine control of how data is persisted 2025-12-04T09:17:20.1380487Z 2025-12-04T09:17:20.1380570Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:20.1380811Z >>> class FP16Planner(DefaultSavePlanner): 2025-12-04T09:17:20.1381044Z >>> def create_local_plan(self): 2025-12-04T09:17:20.1381264Z >>> plan = super().create_local_plan() 2025-12-04T09:17:20.1381490Z >>> for p in plan: 2025-12-04T09:17:20.1381696Z >>> if p.tensor_data is not None: 2025-12-04T09:17:20.1381956Z >>> p.tensor_data.properties.dtype = torch.float16 2025-12-04T09:17:20.1382208Z >>> return plan 2025-12-04T09:17:20.1382373Z >>> 2025-12-04T09:17:20.1382529Z >>> def resolve_data(self, write_item): 2025-12-04T09:17:20.1382758Z >>> item = super().resolve_data(write_item) 2025-12-04T09:17:20.1383125Z >>> return item if write_item.type == WriteItemType.BYTE_IO else item.to(torch.float16) 2025-12-04T09:17:20.1383390Z 2025-12-04T09:17:20.1383625Z Using the global planning step to make central decisions that can't be made individually by each rank 2025-12-04T09:17:20.1390575Z 2025-12-04T09:17:20.1390792Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:20.1391135Z >>> from itertools import zip_longest 2025-12-04T09:17:20.1391373Z >>> from dataclasses import replace 2025-12-04T09:17:20.1391757Z >>> class DDPLoadBalancingPlanner(DefaultSavePlanner): 2025-12-04T09:17:20.1392160Z >>> # This uses the default local plan behavior of having all non-sharded writes in rank 0 2025-12-04T09:17:20.1392516Z >>> # This sample doesn't handle ShardedTensors 2025-12-04T09:17:20.1392776Z >>> def create_global_plan(self, all_plans): 2025-12-04T09:17:20.1393047Z >>> iters = [iter(all_plans[0].items)] * len(all_plans) 2025-12-04T09:17:20.1393304Z >>> items_per_rank = [ 2025-12-04T09:17:20.1393564Z >>> [item for item in items if item is not None] 2025-12-04T09:17:20.1393942Z >>> for items in zip(*zip_longest(*iters), strict=True) 2025-12-04T09:17:20.1394245Z >>> ] 2025-12-04T09:17:20.1394411Z >>> all_plans = [ 2025-12-04T09:17:20.1394603Z >>> replace(plan, items=items) 2025-12-04T09:17:20.1394907Z >>> for plan, items in zip(all_plans, items_per_rank, strict=True) 2025-12-04T09:17:20.1395192Z >>> ] 2025-12-04T09:17:20.1395380Z >>> return super().create_global_plan(all_plans) 2025-12-04T09:17:20.1395557Z 2025-12-04T09:17:20.1395742Z Finally, some planners need to save additional metadata in the checkpoint, this is 2025-12-04T09:17:20.1396190Z accomplished by having each rank contribute their data items in the local plan and 2025-12-04T09:17:20.1396535Z the global planner aggregate them: 2025-12-04T09:17:20.1396675Z 2025-12-04T09:17:20.1396756Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:20.1397022Z >>> class SaveExtraDataPlanner(DefaultSavePlanner): 2025-12-04T09:17:20.1397297Z >>> def create_local_plan(self) -> SavePlan: 2025-12-04T09:17:20.1397590Z >>> plan = super().create_local_plan() 2025-12-04T09:17:20.1397866Z >>> return replace(plan, planner_data="per-rank-data") 2025-12-04T09:17:20.1398116Z >>> 2025-12-04T09:17:20.1398397Z >>> def create_global_plan(self, all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]: 2025-12-04T09:17:20.1398827Z >>> global_plan, metadata = super().create_global_plan(all_plans) 2025-12-04T09:17:20.1399154Z >>> merged_data = [p.planner_data for p in global_plan] 2025-12-04T09:17:20.1399459Z >>> metadata = replace(metadata, planner_data=merged_data) 2025-12-04T09:17:20.1399723Z >>> return global_plan, metadata 2025-12-04T09:17:20.1399875Z 2025-12-04T09:17:20.1400237Z Original Error: IndentationError('expected an indented block after function definition on line 3', ('', 9, 0, '_._ = None\n', 9, -1)) 2025-12-04T09:17:20.1400671Z 2025-12-04T09:17:20.1400730Z _._ = None 2025-12-04T09:17:20.1400876Z ^ 2025-12-04T09:17:20.1401351Z msg = Cannot scrape callname=LoadPlanner in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/planner.py line=305. 2025-12-04T09:17:20.1401986Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:20.1402230Z 2025-12-04T09:17:20.1402432Z Abstract class defining the protocol used by load_state_dict to plan the load process. 2025-12-04T09:17:20.1402706Z 2025-12-04T09:17:20.1402905Z LoadPlanner are stateful objects that can be used to customize the whole load process. 2025-12-04T09:17:20.1403172Z 2025-12-04T09:17:20.1403359Z LoadPlanner acts as an access proxy to the state_dict, so any transformation done to it 2025-12-04T09:17:20.1403705Z will be visible to the whole process. 2025-12-04T09:17:20.1403851Z 2025-12-04T09:17:20.1404044Z A planner subclass can expect the following sequence of calls during load_state_dict: 2025-12-04T09:17:20.1404303Z 2025-12-04T09:17:20.1404394Z 1) set_up_planner - called on all ranks. 2025-12-04T09:17:20.1404635Z Signals the start of loading a checkpoint. 2025-12-04T09:17:20.1404811Z 2025-12-04T09:17:20.1404898Z 2) create_local_plan - called on all ranks. 2025-12-04T09:17:20.1405252Z Process the state_dict and produces a `LoadPlan` that will be sent for global planning. 2025-12-04T09:17:20.1405520Z 2025-12-04T09:17:20.1405702Z 3) create_global_plan - called on the coordinator rank only. 2025-12-04T09:17:20.1406040Z Takes the LoadPlan from all ranks and make any global decision. 2025-12-04T09:17:20.1406253Z 2025-12-04T09:17:20.1406356Z 4) load_bytes - called multiple times on each rank 2025-12-04T09:17:20.1406647Z This is called once per non-tensor value in state_dict. 2025-12-04T09:17:20.1406832Z 2025-12-04T09:17:20.1406980Z 5) resolve_tensor and commit_tensor - called multiple times on each rank 2025-12-04T09:17:20.1407352Z They are called in pair for each Tensor value in state_dict. 2025-12-04T09:17:20.1407557Z 2025-12-04T09:17:20.1407813Z Users are recommended to extend DefaultLoadPlanner instead of this interface directly as 2025-12-04T09:17:20.1408270Z most changes can be expressed by changes in a single method. 2025-12-04T09:17:20.1408466Z 2025-12-04T09:17:20.1408557Z There are two usual patterns of extension: 2025-12-04T09:17:20.1408719Z 2025-12-04T09:17:20.1408893Z Rewriting state_dict. This is the simplest way to extend the load process as it 2025-12-04T09:17:20.1409317Z doesn't requite understanding the intrincacies of how LoadPlan works. We need 2025-12-04T09:17:20.1409713Z to keep a reference to the original state_dict as load happens in place so 2025-12-04T09:17:20.1410029Z we need to be able to perform it in place 2025-12-04T09:17:20.1410190Z 2025-12-04T09:17:20.1410267Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:20.1410506Z >>> class RenamePlanner(DefaultLoadPlanner): 2025-12-04T09:17:20.1410732Z >>> def set_up_planner( 2025-12-04T09:17:20.1410924Z >>> self, 2025-12-04T09:17:20.1411101Z >>> state_dict: STATE_DICT_TYPE, 2025-12-04T09:17:20.1411356Z >>> metadata: Metadata, 2025-12-04T09:17:20.1411555Z >>> is_coordinator: bool, 2025-12-04T09:17:20.1411747Z >>> ) -> None: 2025-12-04T09:17:20.1411922Z >>> self.original_state_dict = state_dict 2025-12-04T09:17:20.1412203Z >>> state_dict = {"foo_" + k: v for k, v in state_dict.items()} 2025-12-04T09:17:20.1412576Z >>> 2025-12-04T09:17:20.1412733Z >>> if self.flatten_sharded_tensors: 2025-12-04T09:17:20.1412986Z >>> state_dict = _flatten_sharded_tensors(state_dict) 2025-12-04T09:17:20.1413220Z >>> 2025-12-04T09:17:20.1413367Z >>> if self.flatten_state_dict: 2025-12-04T09:17:20.1413630Z >>> state_dict, self.mappings = flatten_state_dict(state_dict) 2025-12-04T09:17:20.1413889Z >>> 2025-12-04T09:17:20.1414035Z >>> self.state_dict = state_dict 2025-12-04T09:17:20.1414247Z >>> self.metadata = metadata 2025-12-04T09:17:20.1414476Z >>> self.is_coordinator = is_coordinator 2025-12-04T09:17:20.1414691Z >>> 2025-12-04T09:17:20.1414845Z >>> def load_bytes(self, read_item, value): 2025-12-04T09:17:20.1415093Z >>> # Remove the "foo_" prefix 2025-12-04T09:17:20.1415439Z >>> self.original_state_dict[read_item.dest_index.fqn[4:]] = torch.load(value, weights_only=False) 2025-12-04T09:17:20.1415733Z 2025-12-04T09:17:20.1415736Z 2025-12-04T09:17:20.1415915Z Modifying resolve_tensor and commit_tensor to handle load time transformation. 2025-12-04T09:17:20.1416169Z 2025-12-04T09:17:20.1416248Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:20.1416503Z >>> class MetaModelMaterialize(DefaultSavePlanner): 2025-12-04T09:17:20.1416767Z >>> def resolve_tensor(self, read_item): 2025-12-04T09:17:20.1417001Z >>> tensor = super().resolve_tensor(read_item) 2025-12-04T09:17:20.1417268Z >>> return torch.empty_like(tensor, device="cpu") 2025-12-04T09:17:20.1417500Z >>> 2025-12-04T09:17:20.1417658Z >>> def commit_tensor(self, read_item, tensor): 2025-12-04T09:17:20.1417932Z >>> self.state_dict[read_item.dest_index.fqn] = tensor 2025-12-04T09:17:20.1418118Z 2025-12-04T09:17:20.1418482Z Original Error: IndentationError('expected an indented block after function definition on line 22', ('', 23, 0, '_._ = None\n', 23, -1)) 2025-12-04T09:17:20.1418915Z 2025-12-04T09:17:20.1418981Z _._ = None 2025-12-04T09:17:20.1419184Z ^ 2025-12-04T09:17:20.4536547Z msg = Cannot scrape callname=FullStateDictConfig in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py line=295. 2025-12-04T09:17:20.4537526Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:20.4537883Z 2025-12-04T09:17:20.4538081Z ``FullStateDictConfig`` is a config class meant to be used with 2025-12-04T09:17:20.4538566Z ``StateDictType.FULL_STATE_DICT``. We recommend enabling both 2025-12-04T09:17:20.4538968Z ``offload_to_cpu=True`` and ``rank0_only=True`` when saving full state 2025-12-04T09:17:20.4539620Z dicts to save GPU memory and CPU memory, respectively. This config class 2025-12-04T09:17:20.4540102Z is meant to be used via the :func:`state_dict_type` context manager as 2025-12-04T09:17:20.4540393Z follows: 2025-12-04T09:17:20.4540479Z 2025-12-04T09:17:20.4540579Z >>> # xdoctest: +SKIP("undefined variables") 2025-12-04T09:17:20.4540943Z >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP 2025-12-04T09:17:20.4541313Z >>> fsdp = FSDP(model, auto_wrap_policy=...) 2025-12-04T09:17:20.4541638Z >>> cfg = FullStateDictConfig(offload_to_cpu=True, rank0_only=True) 2025-12-04T09:17:20.4542041Z >>> with FSDP.state_dict_type(fsdp, StateDictType.FULL_STATE_DICT, cfg): 2025-12-04T09:17:20.4542371Z >>> state = fsdp.state_dict() 2025-12-04T09:17:20.4542679Z >>> # `state` will be empty on non rank 0 and contain CPU tensors on rank 0. 2025-12-04T09:17:20.4543089Z >>> # To reload checkpoint for inference, finetuning, transfer learning, etc: 2025-12-04T09:17:20.4543610Z >>> model = model_fn() # Initialize model in preparation for wrapping with FSDP 2025-12-04T09:17:20.4543946Z >>> if dist.get_rank() == 0: 2025-12-04T09:17:20.4544231Z >>> # Load checkpoint only on rank 0 to avoid memory redundancy 2025-12-04T09:17:20.4544548Z >>> state_dict = torch.load("my_checkpoint.pt") 2025-12-04T09:17:20.4544811Z >>> model.load_state_dict(state_dict) 2025-12-04T09:17:20.4545150Z >>> # All ranks initialize FSDP module as usual. `sync_module_states` argument 2025-12-04T09:17:20.4545591Z >>> # communicates loaded checkpoint states from rank 0 to rest of the world. 2025-12-04T09:17:20.4545930Z >>> fsdp = FSDP( 2025-12-04T09:17:20.4546102Z ... model, 2025-12-04T09:17:20.4546298Z ... device_id=torch.cuda.current_device(), 2025-12-04T09:17:20.4546542Z ... auto_wrap_policy=..., 2025-12-04T09:17:20.4546765Z ... sync_module_states=True, 2025-12-04T09:17:20.4546976Z ... ) 2025-12-04T09:17:20.4547217Z >>> # After this point, all ranks have FSDP model with loaded checkpoint. 2025-12-04T09:17:20.4547468Z 2025-12-04T09:17:20.4547533Z Attributes: 2025-12-04T09:17:20.4547780Z rank0_only (bool): If ``True``, then only rank 0 saves the full state 2025-12-04T09:17:20.4548164Z dict, and nonzero ranks save an empty dict. If ``False``, then all 2025-12-04T09:17:20.4548511Z ranks save the full state dict. (Default: ``False``) 2025-12-04T09:17:20.4548715Z 2025-12-04T09:17:20.4549074Z Original Error: IndentationError("expected an indented block after 'if' statement on line 10", ('', 11, 1, '_._ = None\n', 11, 2)) 2025-12-04T09:17:20.4549483Z 2025-12-04T09:17:20.4549547Z _._ = None 2025-12-04T09:17:20.4549686Z ^ 2025-12-04T09:17:22.1626846Z msg = Cannot scrape callname=register_parametrization in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrize.py line=437. 2025-12-04T09:17:22.1627755Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:22.1628237Z Register a parametrization to a tensor in a module. 2025-12-04T09:17:22.1628467Z 2025-12-04T09:17:22.1628706Z Assume that ``tensor_name="weight"`` for simplicity. When accessing ``module.weight``, 2025-12-04T09:17:22.1629206Z the module will return the parametrized version ``parametrization(module.weight)``. 2025-12-04T09:17:22.1629885Z If the original tensor requires a gradient, the backward pass will differentiate 2025-12-04T09:17:22.1630332Z through :attr:`parametrization`, and the optimizer will update the tensor accordingly. 2025-12-04T09:17:22.1630602Z 2025-12-04T09:17:22.1630811Z The first time that a module registers a parametrization, this function will add an attribute 2025-12-04T09:17:22.1631263Z ``parametrizations`` to the module of type :class:`~ParametrizationList`. 2025-12-04T09:17:22.1631505Z 2025-12-04T09:17:22.1631683Z The list of parametrizations on the tensor ``weight`` will be accessible under 2025-12-04T09:17:22.1632125Z ``module.parametrizations.weight``. 2025-12-04T09:17:22.1632372Z 2025-12-04T09:17:22.1632467Z The original tensor will be accessible under 2025-12-04T09:17:22.1632734Z ``module.parametrizations.weight.original``. 2025-12-04T09:17:22.1632906Z 2025-12-04T09:17:22.1633086Z Parametrizations may be concatenated by registering several parametrizations 2025-12-04T09:17:22.1633406Z on the same attribute. 2025-12-04T09:17:22.1633531Z 2025-12-04T09:17:22.1633690Z The training mode of a registered parametrization is updated on registration 2025-12-04T09:17:22.1634031Z to match the training mode of the host module 2025-12-04T09:17:22.1634195Z 2025-12-04T09:17:22.1634409Z Parametrized parameters and buffers have an inbuilt caching system that can be activated 2025-12-04T09:17:22.1634773Z using the context manager :func:`cached`. 2025-12-04T09:17:22.1634934Z 2025-12-04T09:17:22.1635096Z A :attr:`parametrization` may optionally implement a method with signature 2025-12-04T09:17:22.1635342Z 2025-12-04T09:17:22.1635522Z .. code-block:: python 2025-12-04T09:17:22.1635647Z 2025-12-04T09:17:22.1635807Z def right_inverse(self, X: Tensor) -> Union[Tensor, Sequence[Tensor]] 2025-12-04T09:17:22.1636034Z 2025-12-04T09:17:22.1636214Z This method is called on the unparametrized tensor when the first parametrization 2025-12-04T09:17:22.1636618Z is registered to compute the initial value of the original tensor. 2025-12-04T09:17:22.1637045Z If this method is not implemented, the original tensor will be just the unparametrized tensor. 2025-12-04T09:17:22.1637327Z 2025-12-04T09:17:22.1637549Z If all the parametrizations registered on a tensor implement `right_inverse` it is possible 2025-12-04T09:17:22.1638032Z to initialize a parametrized tensor by assigning to it, as shown in the example below. 2025-12-04T09:17:22.1638304Z 2025-12-04T09:17:22.1638458Z It is possible for the first parametrization to depend on several inputs. 2025-12-04T09:17:22.1638865Z This may be implemented returning a tuple of tensors from ``right_inverse`` 2025-12-04T09:17:22.1639260Z (see the example implementation of a ``RankOne`` parametrization below). 2025-12-04T09:17:22.1639501Z 2025-12-04T09:17:22.1639726Z In this case, the unconstrained tensors are also located under ``module.parametrizations.weight`` 2025-12-04T09:17:22.1640122Z with names ``original0``, ``original1``,... 2025-12-04T09:17:22.1640283Z 2025-12-04T09:17:22.1640351Z .. note:: 2025-12-04T09:17:22.1640444Z 2025-12-04T09:17:22.1640624Z If unsafe=False (default) both the forward and right_inverse methods will be called 2025-12-04T09:17:22.1640988Z once to perform a number of consistency checks. 2025-12-04T09:17:22.1641375Z If unsafe=True, then right_inverse will be called if the tensor is not parametrized, 2025-12-04T09:17:22.1641720Z and nothing will be called otherwise. 2025-12-04T09:17:22.1641874Z 2025-12-04T09:17:22.1641938Z .. note:: 2025-12-04T09:17:22.1642025Z 2025-12-04T09:17:22.1642165Z In most situations, ``right_inverse`` will be a function such that 2025-12-04T09:17:22.1642463Z ``forward(right_inverse(X)) == X`` (see 2025-12-04T09:17:22.1642822Z `right inverse `_). 2025-12-04T09:17:22.1643249Z Sometimes, when the parametrization is not surjective, it may be reasonable 2025-12-04T09:17:22.1643613Z to relax this. 2025-12-04T09:17:22.1643725Z 2025-12-04T09:17:22.1643783Z .. warning:: 2025-12-04T09:17:22.1643874Z 2025-12-04T09:17:22.1644061Z If a parametrization depends on several inputs, :func:`~register_parametrization` 2025-12-04T09:17:22.1644485Z will register a number of new parameters. If such parametrization is registered 2025-12-04T09:17:22.1644928Z after the optimizer is created, these new parameters will need to be added manually 2025-12-04T09:17:22.1645362Z to the optimizer. See :meth:`torch.Optimizer.add_param_group`. 2025-12-04T09:17:22.1645601Z 2025-12-04T09:17:22.1645671Z Args: 2025-12-04T09:17:22.1645893Z module (nn.Module): module on which to register the parametrization 2025-12-04T09:17:22.1646255Z tensor_name (str): name of the parameter or buffer on which to register 2025-12-04T09:17:22.1646547Z the parametrization 2025-12-04T09:17:22.1646817Z parametrization (nn.Module): the parametrization to register 2025-12-04T09:17:22.1647111Z Keyword args: 2025-12-04T09:17:22.1647539Z unsafe (bool): a boolean flag that denotes whether the parametrization 2025-12-04T09:17:22.1648022Z may change the dtype and shape of the tensor. Default: `False` 2025-12-04T09:17:22.1648399Z Warning: the parametrization is not checked for consistency upon registration. 2025-12-04T09:17:22.1648741Z Enable this flag at your own risk. 2025-12-04T09:17:22.1648895Z 2025-12-04T09:17:22.1648951Z Raises: 2025-12-04T09:17:22.1649323Z ValueError: if the module does not have a parameter or a buffer named :attr:`tensor_name` 2025-12-04T09:17:22.1649598Z 2025-12-04T09:17:22.1649664Z Examples: 2025-12-04T09:17:22.1649857Z >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK) 2025-12-04T09:17:22.1650096Z >>> import torch 2025-12-04T09:17:22.1650276Z >>> import torch.nn as nn 2025-12-04T09:17:22.1650513Z >>> import torch.nn.utils.parametrize as P 2025-12-04T09:17:22.1650758Z >>> 2025-12-04T09:17:22.1650912Z >>> class Symmetric(nn.Module): 2025-12-04T09:17:22.1651128Z >>> def forward(self, X): 2025-12-04T09:17:22.1651381Z >>> return X.triu() + X.triu(1).T # Return a symmetric matrix 2025-12-04T09:17:22.1651632Z >>> 2025-12-04T09:17:22.1651778Z >>> def right_inverse(self, A): 2025-12-04T09:17:22.1651991Z >>> return A.triu() 2025-12-04T09:17:22.1652175Z >>> 2025-12-04T09:17:22.1652398Z >>> m = nn.Linear(5, 5) 2025-12-04T09:17:22.1652663Z >>> P.register_parametrization(m, "weight", Symmetric()) 2025-12-04T09:17:22.1653025Z >>> print(torch.allclose(m.weight, m.weight.T)) # m.weight is now symmetric 2025-12-04T09:17:22.1653313Z True 2025-12-04T09:17:22.1653467Z >>> A = torch.rand(5, 5) 2025-12-04T09:17:22.1653670Z >>> A = A + A.T # A is now symmetric 2025-12-04T09:17:22.1653949Z >>> m.weight = A # Initialize the weight to be the symmetric matrix A 2025-12-04T09:17:22.1654242Z >>> print(torch.allclose(m.weight, A)) 2025-12-04T09:17:22.1654450Z True 2025-12-04T09:17:22.1654533Z 2025-12-04T09:17:22.1654613Z >>> class RankOne(nn.Module): 2025-12-04T09:17:22.1654817Z >>> def forward(self, x, y): 2025-12-04T09:17:22.1655070Z >>> # Form a rank 1 matrix multiplying two vectors 2025-12-04T09:17:22.1655336Z >>> return x.unsqueeze(-1) @ y.unsqueeze(-2) 2025-12-04T09:17:22.1655552Z >>> 2025-12-04T09:17:22.1655711Z >>> def right_inverse(self, Z): 2025-12-04T09:17:22.1655948Z >>> # Project Z onto the rank 1 matrices 2025-12-04T09:17:22.1656206Z >>> U, S, Vh = torch.linalg.svd(Z, full_matrices=False) 2025-12-04T09:17:22.1656470Z >>> # Return rescaled singular vectors 2025-12-04T09:17:22.1656703Z >>> s0_sqrt = S[0].sqrt().unsqueeze(-1) 2025-12-04T09:17:22.1657020Z >>> return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt 2025-12-04T09:17:22.1657249Z >>> 2025-12-04T09:17:22.1657454Z >>> linear_rank_one = P.register_parametrization( 2025-12-04T09:17:22.1657724Z ... nn.Linear(4, 4), "weight", RankOne() 2025-12-04T09:17:22.1657934Z ... ) 2025-12-04T09:17:22.1658159Z >>> print(torch.linalg.matrix_rank(linear_rank_one.weight).item()) 2025-12-04T09:17:22.1658435Z 1 2025-12-04T09:17:22.1658515Z 2025-12-04T09:17:22.1658575Z 2025-12-04T09:17:22.1659066Z Original Error: IndentationError('expected an indented block after function definition on line 2', ('', 3, 0, '_._ = None\n', 3, -1)) 2025-12-04T09:17:22.1659537Z 2025-12-04T09:17:22.1659595Z _._ = None 2025-12-04T09:17:22.1659736Z ^ 2025-12-04T09:17:22.3677528Z msg = Cannot scrape callname=ReduceLROnPlateau in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py line=1586. 2025-12-04T09:17:22.3678471Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:22.3678999Z Reduce learning rate when a metric has stopped improving. 2025-12-04T09:17:22.3679231Z 2025-12-04T09:17:22.3679394Z Models often benefit from reducing the learning rate by a factor 2025-12-04T09:17:22.3679779Z of 2-10 once learning stagnates. This scheduler reads a metrics 2025-12-04T09:17:22.3680172Z quantity and if no improvement is seen for a 'patience' number 2025-12-04T09:17:22.3680503Z of epochs, the learning rate is reduced. 2025-12-04T09:17:22.3680678Z 2025-12-04T09:17:22.3680759Z Args: 2025-12-04T09:17:22.3681224Z optimizer (Optimizer): Wrapped optimizer. 2025-12-04T09:17:22.3681544Z mode (str): One of `min`, `max`. In `min` mode, lr will 2025-12-04T09:17:22.3681875Z be reduced when the quantity monitored has stopped 2025-12-04T09:17:22.3682204Z decreasing; in `max` mode it will be reduced when the 2025-12-04T09:17:22.3682576Z quantity monitored has stopped increasing. Default: 'min'. 2025-12-04T09:17:22.3682951Z factor (float): Factor by which the learning rate will be 2025-12-04T09:17:22.3683284Z reduced. new_lr = lr * factor. Default: 0.1. 2025-12-04T09:17:22.3683644Z patience (int): The number of allowed epochs with no improvement after 2025-12-04T09:17:22.3684009Z which the learning rate will be reduced. 2025-12-04T09:17:22.3684373Z For example, consider the case of having no patience (`patience = 0`). 2025-12-04T09:17:22.3684925Z In the first epoch, a baseline is established and is always considered good as there's no previous baseline. 2025-12-04T09:17:22.3685492Z In the second epoch, if the performance is worse than the baseline, 2025-12-04T09:17:22.3685867Z we have what is considered an intolerable epoch. 2025-12-04T09:17:22.3686277Z Since the count of intolerable epochs (1) is greater than the patience level (0), 2025-12-04T09:17:22.3686704Z the learning rate is reduced at the end of this epoch. 2025-12-04T09:17:22.3687172Z From the third epoch onwards, the learning rate continues to be reduced at the end of each epoch 2025-12-04T09:17:22.3687754Z if the performance is worse than the baseline. If the performance improves or remains the same, 2025-12-04T09:17:22.3688175Z the learning rate is not adjusted. 2025-12-04T09:17:22.3688424Z Default: 10. 2025-12-04T09:17:22.3688718Z threshold (float): Threshold for measuring the new optimum, 2025-12-04T09:17:22.3689089Z to only focus on significant changes. Default: 1e-4. 2025-12-04T09:17:22.3689439Z threshold_mode (str): One of `rel`, `abs`. In `rel` mode, 2025-12-04T09:17:22.3689769Z dynamic_threshold = best * ( 1 + threshold ) in 'max' 2025-12-04T09:17:22.3690063Z mode or best * ( 1 - threshold ) in `min` mode. 2025-12-04T09:17:22.3690343Z In `abs` mode, dynamic_threshold = best + threshold in 2025-12-04T09:17:22.3690982Z `max` mode or best - threshold in `min` mode. Default: 'rel'. 2025-12-04T09:17:22.3691434Z cooldown (int): Number of epochs to wait before resuming 2025-12-04T09:17:22.3691874Z normal operation after lr has been reduced. Default: 0. 2025-12-04T09:17:22.3692178Z min_lr (float or list): A scalar or a list of scalars. A 2025-12-04T09:17:22.3692586Z lower bound on the learning rate of all param groups 2025-12-04T09:17:22.3692887Z or each group respectively. Default: 0. 2025-12-04T09:17:22.3693280Z eps (float): Minimal decay applied to lr. If the difference 2025-12-04T09:17:22.3693665Z between new and old lr is smaller than eps, the update is 2025-12-04T09:17:22.3693935Z ignored. Default: 1e-8. 2025-12-04T09:17:22.3694068Z 2025-12-04T09:17:22.3694130Z Example: 2025-12-04T09:17:22.3694283Z >>> # xdoctest: +SKIP 2025-12-04T09:17:22.3694596Z >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) 2025-12-04T09:17:22.3694967Z >>> scheduler = ReduceLROnPlateau(optimizer, "min") 2025-12-04T09:17:22.3695370Z >>> for epoch in range(10): 2025-12-04T09:17:22.3695568Z >>> train(...) 2025-12-04T09:17:22.3695757Z >>> val_loss = validate(...) 2025-12-04T09:17:22.3696008Z >>> # Note that step should be called after validate() 2025-12-04T09:17:22.3696256Z >>> scheduler.step(val_loss) 2025-12-04T09:17:22.3696405Z 2025-12-04T09:17:22.3696567Z .. image:: ../scripts/lr_scheduler_images/ReduceLROnPlateau.png 2025-12-04T09:17:22.3696848Z 2025-12-04T09:17:22.3697272Z Original Error: IndentationError('unexpected indent', ('', 8, 4, ' scheduler.step(val_loss)\n', 8, -1)) 2025-12-04T09:17:22.3697630Z 2025-12-04T09:17:22.3697702Z scheduler.step(val_loss) 2025-12-04T09:17:22.3697889Z ^ 2025-12-04T09:17:25.2592730Z running 894 test(s) 2025-12-04T09:17:25.2598553Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::typename:0, line 1111 <- wrt source file 2025-12-04T09:17:25.2606838Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::typename:0 2025-12-04T09:17:25.2608138Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::is_tensor:0, line 1142 <- wrt source file 2025-12-04T09:17:25.2611043Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::is_tensor:0 2025-12-04T09:17:25.2612076Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::is_storage:0, line 1157 <- wrt source file 2025-12-04T09:17:25.2619071Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::is_storage:0 2025-12-04T09:17:25.2620129Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_device:0, line 1247 <- wrt source file 2025-12-04T09:17:25.2621480Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_device:0 2025-12-04T09:17:25.2622596Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_tensor_type:0, line 1296 <- wrt source file 2025-12-04T09:17:25.2623809Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_tensor_type:0 2025-12-04T09:17:25.2624808Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_dtype:0, line 1333 <- wrt source file 2025-12-04T09:17:25.2627375Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_dtype:0 2025-12-04T09:17:25.2628703Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::use_deterministic_algorithms:0, line 1497 <- wrt source file 2025-12-04T09:17:25.2630245Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::use_deterministic_algorithms:0 2025-12-04T09:17:25.2631056Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::compile:0, line 2655 <- wrt source file 2025-12-04T09:17:25.2631827Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::compile:0 2025-12-04T09:17:25.2632886Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::_is_device_backend_autoload_enabled:0, line 2963 <- wrt source file 2025-12-04T09:17:25.2633910Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::_is_device_backend_autoload_enabled:0 2025-12-04T09:17:25.2634993Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so::Generator:0, line 15 <- wrt source file 2025-12-04T09:17:25.2636055Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so::Generator:0 2025-12-04T09:17:25.2637102Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so::_LinAlgError:0, line 5 <- wrt source file 2025-12-04T09:17:25.2638177Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so::_LinAlgError:0 2025-12-04T09:17:25.2639113Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::custom_op:0, line 55 <- wrt source file 2025-12-04T09:17:25.2639918Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::custom_op:0 2025-12-04T09:17:25.2640662Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::impl:0, line 138 <- wrt source file 2025-12-04T09:17:25.2641392Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::impl:0 2025-12-04T09:17:25.2642138Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::impl_abstract:0, line 208 <- wrt source file 2025-12-04T09:17:25.3171019Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::impl_abstract:0 2025-12-04T09:17:25.3172688Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_namedtensor_internals.py::update_names:0, line 118 <- wrt source file 2025-12-04T09:17:25.3175047Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_namedtensor_internals.py::update_names:0 2025-12-04T09:17:25.3176477Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.register_hook:0, line 681 <- wrt source file 2025-12-04T09:17:25.3187833Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.register_hook:0 2025-12-04T09:17:25.3189039Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.register_post_accumulate_grad_hook:0, line 738 <- wrt source file 2025-12-04T09:17:25.3208317Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.register_post_accumulate_grad_hook:0 2025-12-04T09:17:25.3209649Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.refine_names:0, line 1374 <- wrt source file 2025-12-04T09:17:25.3266891Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.refine_names:0 2025-12-04T09:17:25.3268386Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.align_to:0, line 1419 <- wrt source file 2025-12-04T09:17:25.3271755Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.align_to:0 2025-12-04T09:17:25.3272914Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.rename:0, line 1492 <- wrt source file 2025-12-04T09:17:25.3279280Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.rename:0 2025-12-04T09:17:25.3280498Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.to_sparse_coo:0, line 1522 <- wrt source file 2025-12-04T09:17:25.3289514Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.to_sparse_coo:0 2025-12-04T09:17:25.3290669Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.dim_order:0, line 1554 <- wrt source file 2025-12-04T09:17:25.3308525Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.dim_order:0 2025-12-04T09:17:25.3309758Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor_str.py::set_printoptions:0, line 53 <- wrt source file 2025-12-04T09:17:25.3324361Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor_str.py::set_printoptions:0 2025-12-04T09:17:25.3325633Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::broadcast_tensors:0, line 64 <- wrt source file 2025-12-04T09:17:25.3331696Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::broadcast_tensors:0 2025-12-04T09:17:25.3333079Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::broadcast_shapes:0, line 92 <- wrt source file 2025-12-04T09:17:25.3335892Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::broadcast_shapes:0 2025-12-04T09:17:25.3337090Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::split:0, line 144 <- wrt source file 2025-12-04T09:17:25.3348962Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::split:0 2025-12-04T09:17:25.3350057Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::einsum:0, line 258 <- wrt source file 2025-12-04T09:17:25.3367397Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::einsum:0 2025-12-04T09:17:25.3368500Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::meshgrid:0, line 450 <- wrt source file 2025-12-04T09:17:25.3406920Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::meshgrid:0 2025-12-04T09:17:25.3408107Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_unique_impl:0, line 835 <- wrt source file 2025-12-04T09:17:25.3450906Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_unique_impl:0 2025-12-04T09:17:25.3452220Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_unique_consecutive_impl:0, line 992 <- wrt source file 2025-12-04T09:17:25.3463959Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_unique_consecutive_impl:0 2025-12-04T09:17:25.3465257Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::tensordot:0, line 1267 <- wrt source file 2025-12-04T09:17:25.3475006Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::tensordot:0 2025-12-04T09:17:25.3476233Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::cartesian_prod:0, line 1351 <- wrt source file 2025-12-04T09:17:25.3482821Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::cartesian_prod:0 2025-12-04T09:17:25.3483831Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::block_diag:0, line 1385 <- wrt source file 2025-12-04T09:17:25.3493191Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::block_diag:0 2025-12-04T09:17:25.3494343Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::cdist:0, line 1441 <- wrt source file 2025-12-04T09:17:25.3507584Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::cdist:0 2025-12-04T09:17:25.3508560Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_1d:0, line 1482 <- wrt source file 2025-12-04T09:17:25.3525001Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_1d:0 2025-12-04T09:17:25.3526053Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_2d:0, line 1520 <- wrt source file 2025-12-04T09:17:25.3543554Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_2d:0 2025-12-04T09:17:25.3544745Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_3d:0, line 1560 <- wrt source file 2025-12-04T09:17:25.3566337Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_3d:0 2025-12-04T09:17:25.3567528Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::norm:0, line 1735 <- wrt source file 2025-12-04T09:17:25.3599508Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::norm:0 2025-12-04T09:17:25.3600628Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::unravel_index:0, line 1905 <- wrt source file 2025-12-04T09:17:25.3629023Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::unravel_index:0 2025-12-04T09:17:25.3630264Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::chain_matmul:0, line 2005 <- wrt source file 2025-12-04T09:17:25.3631505Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::chain_matmul:0 2025-12-04T09:17:25.3632661Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_lu_impl:0, line 2106 <- wrt source file 2025-12-04T09:17:25.3633814Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_lu_impl:0 2025-12-04T09:17:25.3634888Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::list:0, line 477 <- wrt source file 2025-12-04T09:17:25.3635929Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::list:0 2025-12-04T09:17:25.3636952Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::help:0, line 537 <- wrt source file 2025-12-04T09:17:25.3638004Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::help:0 2025-12-04T09:17:25.3639020Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::load:0, line 628 <- wrt source file 2025-12-04T09:17:25.3640061Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::load:0 2025-12-04T09:17:25.3640973Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::_load_local:0, line 676 <- wrt source file 2025-12-04T09:17:25.3642039Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::_load_local:0 2025-12-04T09:17:25.3643030Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::download_url_to_file:0, line 711 <- wrt source file 2025-12-04T09:17:25.3644078Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::download_url_to_file:0 2025-12-04T09:17:25.3645173Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::load_state_dict_from_url:0, line 852 <- wrt source file 2025-12-04T09:17:25.3646301Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::load_state_dict_from_url:0 2025-12-04T09:17:25.3647333Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library.define:0, line 145 <- wrt source file 2025-12-04T09:17:25.3648379Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library.define:0 2025-12-04T09:17:25.3649468Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library._impl_with_aoti_compile:0, line 239 <- wrt source file 2025-12-04T09:17:25.3650660Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library._impl_with_aoti_compile:0 2025-12-04T09:17:25.3651498Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library.impl:0, line 300 <- wrt source file 2025-12-04T09:17:25.3654988Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library.impl:0 2025-12-04T09:17:25.3655780Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::define:0, line 521 <- wrt source file 2025-12-04T09:17:25.3666477Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::define:0 2025-12-04T09:17:25.3667402Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::impl:0, line 627 <- wrt source file 2025-12-04T09:17:25.3681529Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::impl:0 2025-12-04T09:17:25.3682516Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_kernel:0, line 809 <- wrt source file 2025-12-04T09:17:25.3683524Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_kernel:0 2025-12-04T09:17:25.3684496Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_autocast:0, line 877 <- wrt source file 2025-12-04T09:17:25.3685639Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_autocast:0 2025-12-04T09:17:25.3686617Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_autograd:0, line 1164 <- wrt source file 2025-12-04T09:17:25.3844567Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_autograd:0 2025-12-04T09:17:25.3845992Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_torch_dispatch:0, line 1280 <- wrt source file 2025-12-04T09:17:25.3917425Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_torch_dispatch:0 2025-12-04T09:17:25.3918817Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_vmap:0, line 1369 <- wrt source file 2025-12-04T09:17:25.4066197Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_vmap:0 2025-12-04T09:17:25.4067644Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::opcheck:0, line 1694 <- wrt source file 2025-12-04T09:17:25.4069207Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::opcheck:0 2025-12-04T09:17:25.4070501Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::get_ignored_functions:0, line 117 <- wrt source file 2025-12-04T09:17:25.4072942Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::get_ignored_functions:0 2025-12-04T09:17:25.4074328Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::get_testing_overrides:0, line 435 <- wrt source file 2025-12-04T09:17:25.4106073Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::get_testing_overrides:0 2025-12-04T09:17:25.4107136Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::wrap_torch_function:0, line 1589 <- wrt source file 2025-12-04T09:17:25.4109878Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::wrap_torch_function:0 2025-12-04T09:17:25.4110930Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::handle_torch_function:0, line 1725 <- wrt source file 2025-12-04T09:17:25.4113132Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::handle_torch_function:0 2025-12-04T09:17:25.4114342Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::is_tensor_method_or_property:0, line 1974 <- wrt source file 2025-12-04T09:17:25.4140658Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::is_tensor_method_or_property:0 2025-12-04T09:17:25.4141745Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::is_tensor_like:0, line 1993 <- wrt source file 2025-12-04T09:17:25.4149251Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::is_tensor_like:0 2025-12-04T09:17:25.4150724Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/quasirandom.py::SobolEngine:0, line 39 <- wrt source file 2025-12-04T09:17:25.4151915Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/quasirandom.py::SobolEngine:0 2025-12-04T09:17:25.4153205Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::add_safe_globals:0, line 300 <- wrt source file 2025-12-04T09:17:25.4154589Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::add_safe_globals:0 2025-12-04T09:17:25.4155628Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::safe_globals:0, line 325 <- wrt source file 2025-12-04T09:17:25.4156645Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::safe_globals:0 2025-12-04T09:17:25.4157618Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::skip_data:0, line 401 <- wrt source file 2025-12-04T09:17:25.4158880Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::skip_data:0 2025-12-04T09:17:25.4160187Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::register_package:0, line 473 <- wrt source file 2025-12-04T09:17:25.4161173Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::register_package:0 2025-12-04T09:17:25.4162409Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::save:0, line 960 <- wrt source file 2025-12-04T09:17:25.4163319Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::save:0 2025-12-04T09:17:25.4164450Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::load:0, line 1379 <- wrt source file 2025-12-04T09:17:25.4165779Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::load:0 2025-12-04T09:17:25.4166682Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/torch_version.py::TorchVersion:0, line 19 <- wrt source file 2025-12-04T09:17:25.4167754Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/torch_version.py::TorchVersion:0 2025-12-04T09:17:25.4169088Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/__init__.py::list_mode_options:0, line 349 <- wrt source file 2025-12-04T09:17:25.4170114Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/__init__.py::list_mode_options:0 2025-12-04T09:17:25.4170964Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/__init__.py::list_options:0, line 388 <- wrt source file 2025-12-04T09:17:25.4183398Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/__init__.py::list_options:0 2025-12-04T09:17:25.4184570Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_prims_common/__init__.py::compute_required_storage_length:0, line 1911 <- wrt source file 2025-12-04T09:17:25.4189510Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_prims_common/__init__.py::compute_required_storage_length:0 2025-12-04T09:17:25.4190757Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::current_accelerator:0, line 117 <- wrt source file 2025-12-04T09:17:25.5204417Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::current_accelerator:0 2025-12-04T09:17:25.5206424Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::get_device_capability:0, line 171 <- wrt source file 2025-12-04T09:17:25.5208013Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::get_device_capability:0 2025-12-04T09:17:25.5209471Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::device_index:0, line 276 <- wrt source file 2025-12-04T09:17:25.5210733Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::device_index:0 2025-12-04T09:17:25.5211829Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::allow_in_graph:0, line 130 <- wrt source file 2025-12-04T09:17:25.5213083Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::allow_in_graph:0 2025-12-04T09:17:25.5214218Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::substitute_in_graph:0, line 186 <- wrt source file 2025-12-04T09:17:25.7543582Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::substitute_in_graph:0 2025-12-04T09:17:25.7545206Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::wrap_numpy:0, line 416 <- wrt source file 2025-12-04T09:17:25.7546756Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::wrap_numpy:0 2025-12-04T09:17:25.7548334Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_compiling:0, line 448 <- wrt source file 2025-12-04T09:17:25.7549937Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_compiling:0 2025-12-04T09:17:25.7551949Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_dynamo_compiling:0, line 469 <- wrt source file 2025-12-04T09:17:25.7554027Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_dynamo_compiling:0 2025-12-04T09:17:25.7555645Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_exporting:0, line 487 <- wrt source file 2025-12-04T09:17:25.7559151Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_exporting:0 2025-12-04T09:17:25.7560948Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::save_cache_artifacts:0, line 502 <- wrt source file 2025-12-04T09:17:25.7562892Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::save_cache_artifacts:0 2025-12-04T09:17:25.7564542Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::load_cache_artifacts:0, line 522 <- wrt source file 2025-12-04T09:17:25.7566244Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::load_cache_artifacts:0 2025-12-04T09:17:25.7567859Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py::_compile_kernel:0, line 1788 <- wrt source file 2025-12-04T09:17:25.7569627Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py::_compile_kernel:0 2025-12-04T09:17:25.7571144Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::save:0, line 349 <- wrt source file 2025-12-04T09:17:25.7572761Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::save:0 2025-12-04T09:17:25.7574204Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::load:0, line 422 <- wrt source file 2025-12-04T09:17:25.7575707Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::load:0 2025-12-04T09:17:25.7577168Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::register_dataclass:0, line 581 <- wrt source file 2025-12-04T09:17:25.7578742Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::register_dataclass:0 2025-12-04T09:17:25.7580244Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.then:0, line 152 <- wrt source file 2025-12-04T09:17:25.7581745Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.then:0 2025-12-04T09:17:25.7583214Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.add_done_callback:0, line 201 <- wrt source file 2025-12-04T09:17:25.7584879Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.add_done_callback:0 2025-12-04T09:17:25.7586472Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.set_result:0, line 235 <- wrt source file 2025-12-04T09:17:25.7588035Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.set_result:0 2025-12-04T09:17:25.7589582Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.set_exception:0, line 265 <- wrt source file 2025-12-04T09:17:25.7591195Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.set_exception:0 2025-12-04T09:17:25.7592861Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::collect_all:0, line 299 <- wrt source file 2025-12-04T09:17:25.7594366Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::collect_all:0 2025-12-04T09:17:25.7595775Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/__init__.py::annotate:0, line 147 <- wrt source file 2025-12-04T09:17:25.7597310Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/__init__.py::annotate:0 2025-12-04T09:17:25.7598967Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/monitor/__init__.py::TensorboardEventHandler:0, line 22 <- wrt source file 2025-12-04T09:17:25.7602920Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/monitor/__init__.py::TensorboardEventHandler:0 2025-12-04T09:17:25.7604552Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/mps/__init__.py::compile_shader:0, line 148 <- wrt source file 2025-12-04T09:17:25.7606121Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/mps/__init__.py::compile_shader:0 2025-12-04T09:17:25.7607538Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::as_nested_tensor:0, line 61 <- wrt source file 2025-12-04T09:17:25.7627921Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::as_nested_tensor:0 2025-12-04T09:17:25.7629644Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::nested_tensor:0, line 240 <- wrt source file 2025-12-04T09:17:25.7634529Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::nested_tensor:0 2025-12-04T09:17:25.7636043Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::narrow:0, line 315 <- wrt source file 2025-12-04T09:17:25.7684127Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::narrow:0 2025-12-04T09:17:25.7685755Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::nested_tensor_from_jagged:0, line 405 <- wrt source file 2025-12-04T09:17:25.7692933Z W1204 09:17:25.768000 1836 site-packages/torch/fx/_symbolic_trace.py:53] is_fx_tracing will return true for both fx.symbolic_trace and torch.export. Please use is_fx_tracing_symbolic_tracing() for specifically fx.symbolic_trace or torch.compiler.is_compiling() for specifically torch.export/compile. 2025-12-04T09:17:25.7711074Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::nested_tensor_from_jagged:0 2025-12-04T09:17:25.7712762Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::masked_select:0, line 481 <- wrt source file 2025-12-04T09:17:25.7730274Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::masked_select:0 2025-12-04T09:17:25.7731753Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::sum:0, line 223 <- wrt source file 2025-12-04T09:17:25.7743263Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::sum:0 2025-12-04T09:17:25.7744520Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::check_sparse_tensor_invariants:0, line 475 <- wrt source file 2025-12-04T09:17:25.7752967Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::check_sparse_tensor_invariants:0 2025-12-04T09:17:25.7754222Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::as_sparse_gradcheck:0, line 561 <- wrt source file 2025-12-04T09:17:25.7797844Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::as_sparse_gradcheck:0 2025-12-04T09:17:25.7799106Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/decorators.py::substitute_in_graph:0, line 361 <- wrt source file 2025-12-04T09:17:25.7802139Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/decorators.py::substitute_in_graph:0 2025-12-04T09:17:25.7803553Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/base.py::VariableTracker.python_type:0, line 328 <- wrt source file 2025-12-04T09:17:25.7804968Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/base.py::VariableTracker.python_type:0 2025-12-04T09:17:25.7806506Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py::speculate_subgraph_with_auto_output_flattening:0, line 1316 <- wrt source file 2025-12-04T09:17:25.7808187Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py::speculate_subgraph_with_auto_output_flattening:0 2025-12-04T09:17:25.7809674Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/utils.py::register_module_as_pytree_input_node:0, line 1441 <- wrt source file 2025-12-04T09:17:25.7811080Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/utils.py::register_module_as_pytree_input_node:0 2025-12-04T09:17:25.7812574Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/wrappers.py::mark_subclass_constructor_exportable_experimental:0, line 194 <- wrt source file 2025-12-04T09:17:25.7814025Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/wrappers.py::mark_subclass_constructor_exportable_experimental:0 2025-12-04T09:17:25.7815320Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/wrappers.py::allow_in_pre_dispatch_graph:0, line 262 <- wrt source file 2025-12-04T09:17:25.7816553Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/wrappers.py::allow_in_pre_dispatch_graph:0 2025-12-04T09:17:25.7817731Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py::aot_function:0, line 771 <- wrt source file 2025-12-04T09:17:25.8098239Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py::aot_function:0 2025-12-04T09:17:25.8099812Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/benchmark_utils.py::benchmark_utilization:0, line 184 <- wrt source file 2025-12-04T09:17:25.8101383Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/benchmark_utils.py::benchmark_utilization:0 2025-12-04T09:17:25.8102751Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::vjp:0, line 234 <- wrt source file 2025-12-04T09:17:25.8136888Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::vjp:0 2025-12-04T09:17:25.8137997Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jacrev:0, line 476 <- wrt source file 2025-12-04T09:17:25.8195934Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jacrev:0 2025-12-04T09:17:25.8197191Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jvp:0, line 1024 <- wrt source file 2025-12-04T09:17:26.0476176Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jvp:0 2025-12-04T09:17:26.0477696Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jacfwd:0, line 1182 <- wrt source file 2025-12-04T09:17:26.0538463Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jacfwd:0 2025-12-04T09:17:26.0540305Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::hessian:0, line 1342 <- wrt source file 2025-12-04T09:17:26.0557216Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::hessian:0 2025-12-04T09:17:26.0558736Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::functionalize:0, line 1506 <- wrt source file 2025-12-04T09:17:26.0562468Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::functionalize:0 2025-12-04T09:17:26.0563987Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::linearize:0, line 1705 <- wrt source file 2025-12-04T09:17:26.0725674Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::linearize:0 2025-12-04T09:17:26.0727387Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/functional_call.py::functional_call:0, line 36 <- wrt source file 2025-12-04T09:17:26.0730723Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/functional_call.py::functional_call:0 2025-12-04T09:17:26.0732065Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/fx_minifier.py::minifier:0, line 194 <- wrt source file 2025-12-04T09:17:26.0733451Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/fx_minifier.py::minifier:0 2025-12-04T09:17:26.0734857Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/schemas.py::CompilerWrapper.post_compile:0, line 1111 <- wrt source file 2025-12-04T09:17:26.0736289Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/schemas.py::CompilerWrapper.post_compile:0 2025-12-04T09:17:26.0737683Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/schemas.py::InductorWrapper.post_compile:0, line 1166 <- wrt source file 2025-12-04T09:17:26.0739091Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/schemas.py::InductorWrapper.post_compile:0 2025-12-04T09:17:26.0740403Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/associative_scan.py::associative_scan:0, line 183 <- wrt source file 2025-12-04T09:17:26.0741384Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/associative_scan.py::associative_scan:0 2025-12-04T09:17:26.0742373Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/associative_scan.py::generic_associative_scan:0, line 319 <- wrt source file 2025-12-04T09:17:26.0743387Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/associative_scan.py::generic_associative_scan:0 2025-12-04T09:17:26.0744257Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/cond.py::cond:0, line 139 <- wrt source file 2025-12-04T09:17:26.0745267Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/cond.py::cond:0 2025-12-04T09:17:26.0746109Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/flat_apply.py::FlatApply.__call__:0, line 80 <- wrt source file 2025-12-04T09:17:26.0747040Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/flat_apply.py::FlatApply.__call__:0 2025-12-04T09:17:26.0748180Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/map.py::map:0, line 80 <- wrt source file 2025-12-04T09:17:26.0749241Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/map.py::map:0 2025-12-04T09:17:26.0750518Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/partitioner.py::HopPartitionedGraph._reorder_fw_output:0, line 133 <- wrt source file 2025-12-04T09:17:26.0751767Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/partitioner.py::HopPartitionedGraph._reorder_fw_output:0 2025-12-04T09:17:26.0752993Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/scan.py::scan:0, line 130 <- wrt source file 2025-12-04T09:17:26.0754043Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/scan.py::scan:0 2025-12-04T09:17:26.0755097Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py::WritableTempFile:0, line 385 <- wrt source file 2025-12-04T09:17:26.0756165Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py::WritableTempFile:0 2025-12-04T09:17:26.0757357Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/cpp_builder.py::get_name_and_dir_from_output_file_path:0, line 1845 <- wrt source file 2025-12-04T09:17:26.0758430Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/cpp_builder.py::get_name_and_dir_from_output_file_path:0 2025-12-04T09:17:26.0759400Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py::add_preprocessing_fn:0, line 4328 <- wrt source file 2025-12-04T09:17:26.0760366Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py::add_preprocessing_fn:0 2025-12-04T09:17:26.0761282Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/fx_passes/graph_view.py::_clean_stack_name:0, line 100 <- wrt source file 2025-12-04T09:17:26.0762536Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/fx_passes/graph_view.py::_clean_stack_name:0 2025-12-04T09:17:26.0763471Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/kernel/custom_op.py::CustomOpConfig:0, line 56 <- wrt source file 2025-12-04T09:17:26.0764381Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/kernel/custom_op.py::CustomOpConfig:0 2025-12-04T09:17:26.0765338Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/kernel/custom_op.py::register_custom_op_autotuning:0, line 423 <- wrt source file 2025-12-04T09:17:26.0766349Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/kernel/custom_op.py::register_custom_op_autotuning:0 2025-12-04T09:17:26.0767336Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_acquire_lock_with_timeout:0, line 69 <- wrt source file 2025-12-04T09:17:26.0768328Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_acquire_lock_with_timeout:0 2025-12-04T09:17:26.0769448Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_unsafe_acquire_lock_with_timeout:0, line 105 <- wrt source file 2025-12-04T09:17:26.0770500Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_unsafe_acquire_lock_with_timeout:0 2025-12-04T09:17:26.0771558Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_acquire_flock_with_timeout:0, line 142 <- wrt source file 2025-12-04T09:17:26.0772778Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_acquire_flock_with_timeout:0 2025-12-04T09:17:26.0773782Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_unsafe_acquire_flock_with_timeout:0, line 179 <- wrt source file 2025-12-04T09:17:26.0774832Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_unsafe_acquire_flock_with_timeout:0 2025-12-04T09:17:26.0775863Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/template_heuristics/registry.py::register_template_heuristic:0, line 54 <- wrt source file 2025-12-04T09:17:26.0776921Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/template_heuristics/registry.py::register_template_heuristic:0 2025-12-04T09:17:26.0777884Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::custom_op:0, line 101 <- wrt source file 2025-12-04T09:17:26.1115949Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::custom_op:0 2025-12-04T09:17:26.1117499Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.set_kernel_enabled:0, line 241 <- wrt source file 2025-12-04T09:17:26.1199091Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.set_kernel_enabled:0 2025-12-04T09:17:26.1200907Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_kernel:0, line 310 <- wrt source file 2025-12-04T09:17:26.1202710Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_kernel:0 2025-12-04T09:17:26.1204450Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_autograd:0, line 549 <- wrt source file 2025-12-04T09:17:26.1358939Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_autograd:0 2025-12-04T09:17:26.1360543Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_vmap:0, line 724 <- wrt source file 2025-12-04T09:17:26.1518639Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_vmap:0 2025-12-04T09:17:26.1520115Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_autocast:0, line 810 <- wrt source file 2025-12-04T09:17:26.1521624Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_autocast:0 2025-12-04T09:17:26.1523052Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_class_registry.py::register_fake_class:0, line 273 <- wrt source file 2025-12-04T09:17:26.1524489Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_class_registry.py::register_fake_class:0 2025-12-04T09:17:26.1526056Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_impl.py::FakeImplCtx.new_dynamic_size:0, line 175 <- wrt source file 2025-12-04T09:17:26.1592946Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_impl.py::FakeImplCtx.new_dynamic_size:0 2025-12-04T09:17:26.1594624Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/infer_schema.py::infer_schema:0, line 53 <- wrt source file 2025-12-04T09:17:26.1601281Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/infer_schema.py::infer_schema:0 2025-12-04T09:17:26.1602932Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/triton.py::triton_op:0, line 136 <- wrt source file 2025-12-04T09:17:26.1604421Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/triton.py::triton_op:0 2025-12-04T09:17:26.1605771Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/triton.py::wrap_triton:0, line 307 <- wrt source file 2025-12-04T09:17:26.1607139Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/triton.py::wrap_triton:0 2025-12-04T09:17:26.1608484Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_logging/_internal.py::set_logs:0, line 460 <- wrt source file 2025-12-04T09:17:26.1609962Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_logging/_internal.py::set_logs:0 2025-12-04T09:17:26.1611093Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_equal:0, line 171 <- wrt source file 2025-12-04T09:17:26.1642054Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_equal:0 2025-12-04T09:17:26.1643274Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::print_assert_equal:0, line 302 <- wrt source file 2025-12-04T09:17:26.1644492Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::print_assert_equal:0 2025-12-04T09:17:26.1645671Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_almost_equal:0, line 375 <- wrt source file 2025-12-04T09:17:26.1688221Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_almost_equal:0 2025-12-04T09:17:26.1689754Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_approx_equal:0, line 496 <- wrt source file 2025-12-04T09:17:26.1691780Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_approx_equal:0 2025-12-04T09:17:26.1693383Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_equal:0, line 793 <- wrt source file 2025-12-04T09:17:26.1749655Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_equal:0 2025-12-04T09:17:26.1751183Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_almost_equal:0, line 899 <- wrt source file 2025-12-04T09:17:26.1807307Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_almost_equal:0 2025-12-04T09:17:26.1808816Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_less:0, line 1008 <- wrt source file 2025-12-04T09:17:26.1857786Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_less:0 2025-12-04T09:17:26.1859277Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_string_equal:0, line 1073 <- wrt source file 2025-12-04T09:17:26.1860779Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_string_equal:0 2025-12-04T09:17:26.1862691Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_allclose:0, line 1294 <- wrt source file 2025-12-04T09:17:26.1877640Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_allclose:0 2025-12-04T09:17:26.1878919Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_almost_equal_nulp:0, line 1360 <- wrt source file 2025-12-04T09:17:26.1882574Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_almost_equal_nulp:0 2025-12-04T09:17:26.1883826Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_max_ulp:0, line 1423 <- wrt source file 2025-12-04T09:17:26.1887752Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_max_ulp:0 2025-12-04T09:17:26.1889210Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::nulp_diff:0, line 1468 <- wrt source file 2025-12-04T09:17:26.1890375Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::nulp_diff:0 2025-12-04T09:17:26.1891492Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_warns:0, line 1578 <- wrt source file 2025-12-04T09:17:26.1895764Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_warns:0 2025-12-04T09:17:26.1897025Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::clear_and_catch_warnings:0, line 1881 <- wrt source file 2025-12-04T09:17:26.1898813Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::clear_and_catch_warnings:0 2025-12-04T09:17:26.1900104Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_prims/context.py::TorchRefsMode:0, line 95 <- wrt source file 2025-12-04T09:17:26.1901230Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_prims/context.py::TorchRefsMode:0 2025-12-04T09:17:26.1902455Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/complex_tensor/_ops/common.py::is_complex_tensor:0, line 47 <- wrt source file 2025-12-04T09:17:26.1903747Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/complex_tensor/_ops/common.py::is_complex_tensor:0 2025-12-04T09:17:26.1904885Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/amp/grad_scaler.py::GradScaler:0, line 64 <- wrt source file 2025-12-04T09:17:26.1905936Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/amp/grad_scaler.py::GradScaler:0 2025-12-04T09:17:26.1907064Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/qat/modules/linear_relu.py::LinearReLU:0, line 34 <- wrt source file 2025-12-04T09:17:26.1908330Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/qat/modules/linear_relu.py::LinearReLU:0 2025-12-04T09:17:26.1909626Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py::LinearReLU:0, line 24 <- wrt source file 2025-12-04T09:17:26.1910917Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py::LinearReLU:0 2025-12-04T09:17:26.1911958Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearReLU:0, line 25 <- wrt source file 2025-12-04T09:17:26.1913033Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearReLU:0 2025-12-04T09:17:26.1914109Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearLeakyReLU:0, line 67 <- wrt source file 2025-12-04T09:17:26.1915198Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearLeakyReLU:0 2025-12-04T09:17:26.1916243Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearTanh:0, line 142 <- wrt source file 2025-12-04T09:17:26.1917282Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearTanh:0 2025-12-04T09:17:26.1919482Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantizable/modules/rnn.py::LSTMCell:0, line 29 <- wrt source file 2025-12-04T09:17:26.1928976Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantizable/modules/rnn.py::LSTMCell:0 2025-12-04T09:17:26.1930693Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantizable/modules/rnn.py::LSTM:0, line 413 <- wrt source file 2025-12-04T09:17:26.1959054Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantizable/modules/rnn.py::LSTM:0 2025-12-04T09:17:26.1960751Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv1d:0, line 210 <- wrt source file 2025-12-04T09:17:26.1962751Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv1d:0 2025-12-04T09:17:26.1964395Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv2d:0, line 282 <- wrt source file 2025-12-04T09:17:26.1966101Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv2d:0 2025-12-04T09:17:26.1967760Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv3d:0, line 358 <- wrt source file 2025-12-04T09:17:26.1969492Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv3d:0 2025-12-04T09:17:26.1971158Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/__init__.py::Quantize:0, line 95 <- wrt source file 2025-12-04T09:17:26.1973094Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/__init__.py::Quantize:0 2025-12-04T09:17:26.1974877Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/__init__.py::DeQuantize:0, line 145 <- wrt source file 2025-12-04T09:17:26.1978016Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/__init__.py::DeQuantize:0 2025-12-04T09:17:26.1979829Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv1d:0, line 43 <- wrt source file 2025-12-04T09:17:26.1981921Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv1d:0 2025-12-04T09:17:26.1983709Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv2d:0, line 126 <- wrt source file 2025-12-04T09:17:26.1985573Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv2d:0 2025-12-04T09:17:26.1987487Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv3d:0, line 212 <- wrt source file 2025-12-04T09:17:26.1989457Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv3d:0 2025-12-04T09:17:26.1991325Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose1d:0, line 300 <- wrt source file 2025-12-04T09:17:26.1993346Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose1d:0 2025-12-04T09:17:26.1995312Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose2d:0, line 383 <- wrt source file 2025-12-04T09:17:26.1997444Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose2d:0 2025-12-04T09:17:26.1999420Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose3d:0, line 466 <- wrt source file 2025-12-04T09:17:26.2001406Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose3d:0 2025-12-04T09:17:26.2003283Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/linear.py::Linear:0, line 30 <- wrt source file 2025-12-04T09:17:26.2005137Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/linear.py::Linear:0 2025-12-04T09:17:26.2006889Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::LSTM:0, line 516 <- wrt source file 2025-12-04T09:17:26.2008688Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::LSTM:0 2025-12-04T09:17:26.2010258Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::GRU:0, line 803 <- wrt source file 2025-12-04T09:17:26.2011936Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::GRU:0 2025-12-04T09:17:26.2013740Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::RNNCell:0, line 1209 <- wrt source file 2025-12-04T09:17:26.2015422Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::RNNCell:0 2025-12-04T09:17:26.2017133Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::LSTMCell:0, line 1276 <- wrt source file 2025-12-04T09:17:26.2019057Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::LSTMCell:0 2025-12-04T09:17:26.2020848Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::GRUCell:0, line 1329 <- wrt source file 2025-12-04T09:17:26.2022790Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::GRUCell:0 2025-12-04T09:17:26.2024557Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/activation.py::ReLU6:0, line 36 <- wrt source file 2025-12-04T09:17:26.2026404Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/activation.py::ReLU6:0 2025-12-04T09:17:26.2028270Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv1d:0, line 376 <- wrt source file 2025-12-04T09:17:26.2030031Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv1d:0 2025-12-04T09:17:26.2031736Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv2d:0, line 506 <- wrt source file 2025-12-04T09:17:26.2033504Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv2d:0 2025-12-04T09:17:26.2035169Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv3d:0, line 636 <- wrt source file 2025-12-04T09:17:26.2036891Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv3d:0 2025-12-04T09:17:26.2038789Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose1d:0, line 893 <- wrt source file 2025-12-04T09:17:26.2040708Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose1d:0 2025-12-04T09:17:26.2042526Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose2d:0, line 1015 <- wrt source file 2025-12-04T09:17:26.2044426Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose2d:0 2025-12-04T09:17:26.2046262Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose3d:0, line 1141 <- wrt source file 2025-12-04T09:17:26.2048118Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose3d:0 2025-12-04T09:17:26.2049911Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/embedding_ops.py::Embedding:0, line 111 <- wrt source file 2025-12-04T09:17:26.2051786Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/embedding_ops.py::Embedding:0 2025-12-04T09:17:26.2053777Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/embedding_ops.py::EmbeddingBag:0, line 275 <- wrt source file 2025-12-04T09:17:26.2055755Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/embedding_ops.py::EmbeddingBag:0 2025-12-04T09:17:26.2057720Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/functional_modules.py::FloatFunctional:0, line 23 <- wrt source file 2025-12-04T09:17:26.2069719Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/functional_modules.py::FloatFunctional:0 2025-12-04T09:17:26.2071808Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/functional_modules.py::QFunctional:0, line 176 <- wrt source file 2025-12-04T09:17:26.2073831Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/functional_modules.py::QFunctional:0 2025-12-04T09:17:26.2075825Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/linear.py::Linear:0, line 135 <- wrt source file 2025-12-04T09:17:26.2077597Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/linear.py::Linear:0 2025-12-04T09:17:26.2079290Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/rnn.py::LSTM:0, line 24 <- wrt source file 2025-12-04T09:17:26.2081087Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/rnn.py::LSTM:0 2025-12-04T09:17:26.2083253Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/data_scheduler/base_data_scheduler.py::BaseDataScheduler.get_schedule_param:0, line 98 <- wrt source file 2025-12-04T09:17:26.2092703Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/data_scheduler/base_data_scheduler.py::BaseDataScheduler.get_schedule_param:0 2025-12-04T09:17:26.2095134Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/data_sparsifier/base_data_sparsifier.py::BaseDataSparsifier:0, line 55 <- wrt source file 2025-12-04T09:17:26.2097496Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/data_sparsifier/base_data_sparsifier.py::BaseDataSparsifier:0 2025-12-04T09:17:26.2099660Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/scheduler/lambda_scheduler.py::LambdaSL:0, line 24 <- wrt source file 2025-12-04T09:17:26.2101562Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/scheduler/lambda_scheduler.py::LambdaSL:0 2025-12-04T09:17:26.2103450Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/sparsifier/base_sparsifier.py::BaseSparsifier:0, line 47 <- wrt source file 2025-12-04T09:17:26.2105442Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/sparsifier/base_sparsifier.py::BaseSparsifier:0 2025-12-04T09:17:26.2107440Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/sparsifier/base_sparsifier.py::BaseSparsifier.squash_mask:0, line 251 <- wrt source file 2025-12-04T09:17:26.2109575Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/sparsifier/base_sparsifier.py::BaseSparsifier.squash_mask:0 2025-12-04T09:17:26.2111259Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuse_modules.py::fuse_modules:0, line 175 <- wrt source file 2025-12-04T09:17:26.2112805Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuse_modules.py::fuse_modules:0 2025-12-04T09:17:26.2114597Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_conv_bn:0, line 32 <- wrt source file 2025-12-04T09:17:26.2118909Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_conv_bn:0 2025-12-04T09:17:26.2120829Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_conv_bn_relu:0, line 83 <- wrt source file 2025-12-04T09:17:26.2127068Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_conv_bn_relu:0 2025-12-04T09:17:26.2129037Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_linear_bn:0, line 143 <- wrt source file 2025-12-04T09:17:26.2134176Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_linear_bn:0 2025-12-04T09:17:26.2136162Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_convtranspose_bn:0, line 182 <- wrt source file 2025-12-04T09:17:26.2141803Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_convtranspose_bn:0 2025-12-04T09:17:26.2143771Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/observer.py::_with_args:0, line 110 <- wrt source file 2025-12-04T09:17:26.2145535Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/observer.py::_with_args:0 2025-12-04T09:17:26.2147249Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/observer.py::_with_callable_args:0, line 132 <- wrt source file 2025-12-04T09:17:26.2149080Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/observer.py::_with_callable_args:0 2025-12-04T09:17:26.2150773Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::fuse_fx:0, line 218 <- wrt source file 2025-12-04T09:17:26.2152504Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::fuse_fx:0 2025-12-04T09:17:26.2154299Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::prepare_fx:0, line 288 <- wrt source file 2025-12-04T09:17:26.2156062Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::prepare_fx:0 2025-12-04T09:17:26.2157791Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::prepare_qat_fx:0, line 427 <- wrt source file 2025-12-04T09:17:26.2159600Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::prepare_qat_fx:0 2025-12-04T09:17:26.2161321Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::convert_fx:0, line 608 <- wrt source file 2025-12-04T09:17:26.2163283Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::convert_fx:0 2025-12-04T09:17:26.2165156Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::convert_to_reference_fx:0, line 668 <- wrt source file 2025-12-04T09:17:26.2167118Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::convert_to_reference_fx:0 2025-12-04T09:17:26.2169088Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::_convert_to_reference_decomposed_fx:0, line 720 <- wrt source file 2025-12-04T09:17:26.2171182Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::_convert_to_reference_decomposed_fx:0 2025-12-04T09:17:26.2173189Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::prepare_pt2e:0, line 51 <- wrt source file 2025-12-04T09:17:26.2175037Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::prepare_pt2e:0 2025-12-04T09:17:26.2176838Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::prepare_qat_pt2e:0, line 130 <- wrt source file 2025-12-04T09:17:26.2178680Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::prepare_qat_pt2e:0 2025-12-04T09:17:26.2180620Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::convert_pt2e:0, line 228 <- wrt source file 2025-12-04T09:17:26.2182443Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::convert_pt2e:0 2025-12-04T09:17:26.2184165Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::get_combined_dict:0, line 171 <- wrt source file 2025-12-04T09:17:26.2186222Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::get_combined_dict:0 2025-12-04T09:17:26.2187965Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_path_of_module:0, line 553 <- wrt source file 2025-12-04T09:17:26.2189740Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_path_of_module:0 2025-12-04T09:17:26.2191497Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_signature_locals:0, line 575 <- wrt source file 2025-12-04T09:17:26.2193304Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_signature_locals:0 2025-12-04T09:17:26.2195067Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_default_kwargs:0, line 589 <- wrt source file 2025-12-04T09:17:26.2196981Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_default_kwargs:0 2025-12-04T09:17:26.2198747Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_normalize_kwargs:0, line 611 <- wrt source file 2025-12-04T09:17:26.2200524Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_normalize_kwargs:0 2025-12-04T09:17:26.2202213Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_num_pos_args:0, line 738 <- wrt source file 2025-12-04T09:17:26.2203927Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_num_pos_args:0 2025-12-04T09:17:26.2205815Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/backend_config/backend_config.py::DTypeConfig:0, line 216 <- wrt source file 2025-12-04T09:17:26.2207915Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/backend_config/backend_config.py::DTypeConfig:0 2025-12-04T09:17:26.2209930Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/backend_config/onednn.py::_fuse_linear_bn_leaky_relu:0, line 85 <- wrt source file 2025-12-04T09:17:26.2212037Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/backend_config/onednn.py::_fuse_linear_bn_leaky_relu:0 2025-12-04T09:17:26.2214084Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report.py::ModelReport:0, line 85 <- wrt source file 2025-12-04T09:17:26.2216046Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report.py::ModelReport:0 2025-12-04T09:17:26.2218305Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_filtered_tables:0, line 341 <- wrt source file 2025-12-04T09:17:26.2221058Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_filtered_tables:0 2025-12-04T09:17:26.2223735Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_table_visualization:0, line 429 <- wrt source file 2025-12-04T09:17:26.2226385Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_table_visualization:0 2025-12-04T09:17:26.2229079Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_plot_visualization:0, line 591 <- wrt source file 2025-12-04T09:17:26.2231808Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_plot_visualization:0 2025-12-04T09:17:26.2234428Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_histogram_visualization:0, line 664 <- wrt source file 2025-12-04T09:17:26.2237128Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_histogram_visualization:0 2025-12-04T09:17:26.2239488Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/_affine_quantization.py::_get_reduction_params:0, line 104 <- wrt source file 2025-12-04T09:17:26.2241593Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/_affine_quantization.py::_get_reduction_params:0 2025-12-04T09:17:26.2243584Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/_affine_quantization.py::_register_custom_op:0, line 155 <- wrt source file 2025-12-04T09:17:26.2245639Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/_affine_quantization.py::_register_custom_op:0 2025-12-04T09:17:26.2247572Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/prepare.py::_get_edge_or_node_to_group_id:0, line 189 <- wrt source file 2025-12-04T09:17:26.2249567Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/prepare.py::_get_edge_or_node_to_group_id:0 2025-12-04T09:17:26.2251564Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/utils.py::_replace_literals_with_new_placeholders:0, line 442 <- wrt source file 2025-12-04T09:17:26.2253775Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/utils.py::_replace_literals_with_new_placeholders:0 2025-12-04T09:17:26.2255633Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/anomaly_mode.py::detect_anomaly:0, line 28 <- wrt source file 2025-12-04T09:17:26.2257324Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/anomaly_mode.py::detect_anomaly:0 2025-12-04T09:17:26.2258930Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::make_dual:0, line 82 <- wrt source file 2025-12-04T09:17:26.2260569Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::make_dual:0 2025-12-04T09:17:26.2262368Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::unpack_dual:0, line 151 <- wrt source file 2025-12-04T09:17:26.2264045Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::unpack_dual:0 2025-12-04T09:17:26.2265738Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::dual_level:0, line 187 <- wrt source file 2025-12-04T09:17:26.2267340Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::dual_level:0 2025-12-04T09:17:26.2269060Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.save_for_backward:0, line 72 <- wrt source file 2025-12-04T09:17:26.2271200Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.save_for_backward:0 2025-12-04T09:17:26.2273047Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.save_for_forward:0, line 116 <- wrt source file 2025-12-04T09:17:26.2274938Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.save_for_forward:0 2025-12-04T09:17:26.2276705Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.mark_dirty:0, line 169 <- wrt source file 2025-12-04T09:17:26.2278546Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.mark_dirty:0 2025-12-04T09:17:26.2280612Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.mark_non_differentiable:0, line 216 <- wrt source file 2025-12-04T09:17:26.2282670Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.mark_non_differentiable:0 2025-12-04T09:17:26.2284557Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.set_materialize_grads:0, line 245 <- wrt source file 2025-12-04T09:17:26.2286458Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.set_materialize_grads:0 2025-12-04T09:17:26.2288184Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::Function:0, line 487 <- wrt source file 2025-12-04T09:17:26.2289798Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::Function:0 2025-12-04T09:17:26.2291378Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::vjp:0, line 300 <- wrt source file 2025-12-04T09:17:26.2293101Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::vjp:0 2025-12-04T09:17:26.2294656Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::jvp:0, line 402 <- wrt source file 2025-12-04T09:17:26.2296234Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::jvp:0 2025-12-04T09:17:26.2297800Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::jacobian:0, line 642 <- wrt source file 2025-12-04T09:17:26.2299472Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::jacobian:0 2025-12-04T09:17:26.2301085Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::hessian:0, line 907 <- wrt source file 2025-12-04T09:17:26.2302712Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::hessian:0 2025-12-04T09:17:26.2304229Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::vhp:0, line 1026 <- wrt source file 2025-12-04T09:17:26.2305935Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::vhp:0 2025-12-04T09:17:26.2307458Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::hvp:0, line 1125 <- wrt source file 2025-12-04T09:17:26.2308970Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::hvp:0 2025-12-04T09:17:26.2310465Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::no_grad:0, line 50 <- wrt source file 2025-12-04T09:17:26.2312272Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::no_grad:0 2025-12-04T09:17:26.2313859Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::enable_grad:0, line 108 <- wrt source file 2025-12-04T09:17:26.2315484Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::enable_grad:0 2025-12-04T09:17:26.2317106Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::set_grad_enabled:0, line 166 <- wrt source file 2025-12-04T09:17:26.2318819Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::set_grad_enabled:0 2025-12-04T09:17:26.2320484Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::inference_mode:0, line 252 <- wrt source file 2025-12-04T09:17:26.2322311Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::inference_mode:0 2025-12-04T09:17:26.2323897Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.name:0, line 60 <- wrt source file 2025-12-04T09:17:26.2325478Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.name:0 2025-12-04T09:17:26.2327020Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.register_hook:0, line 117 <- wrt source file 2025-12-04T09:17:26.2328662Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.register_hook:0 2025-12-04T09:17:26.2330247Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.register_prehook:0, line 154 <- wrt source file 2025-12-04T09:17:26.2331929Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.register_prehook:0 2025-12-04T09:17:26.2333668Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::saved_tensors_hooks:0, line 292 <- wrt source file 2025-12-04T09:17:26.2335413Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::saved_tensors_hooks:0 2025-12-04T09:17:26.2337025Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::save_on_cpu:0, line 362 <- wrt source file 2025-12-04T09:17:26.2338578Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::save_on_cpu:0 2025-12-04T09:17:26.2340201Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::disable_saved_tensors_hooks:0, line 419 <- wrt source file 2025-12-04T09:17:26.2342002Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::disable_saved_tensors_hooks:0 2025-12-04T09:17:26.2343730Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::register_multi_grad_hook:0, line 503 <- wrt source file 2025-12-04T09:17:26.2345490Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::register_multi_grad_hook:0 2025-12-04T09:17:26.2347378Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::allow_mutation_on_saved_tensors:0, line 777 <- wrt source file 2025-12-04T09:17:26.2349223Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::allow_mutation_on_saved_tensors:0 2025-12-04T09:17:26.2350886Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::profile:0, line 182 <- wrt source file 2025-12-04T09:17:26.2352548Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::profile:0 2025-12-04T09:17:26.2354222Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::record_function:0, line 760 <- wrt source file 2025-12-04T09:17:26.2355919Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::record_function:0 2025-12-04T09:17:26.2357515Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::emit_itt:0, line 899 <- wrt source file 2025-12-04T09:17:26.2359104Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::emit_itt:0 2025-12-04T09:17:26.2360657Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::emit_nvtx:0, line 972 <- wrt source file 2025-12-04T09:17:26.2362581Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::emit_nvtx:0 2025-12-04T09:17:26.2364212Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler_util.py::EventList:0, line 60 <- wrt source file 2025-12-04T09:17:26.2365879Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler_util.py::EventList:0 2025-12-04T09:17:26.2367453Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::gds_register_buffer:0, line 43 <- wrt source file 2025-12-04T09:17:26.2369030Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::gds_register_buffer:0 2025-12-04T09:17:26.2370548Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::gds_deregister_buffer:0, line 59 <- wrt source file 2025-12-04T09:17:26.2372148Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::gds_deregister_buffer:0 2025-12-04T09:17:26.2373707Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::GdsFile:0, line 86 <- wrt source file 2025-12-04T09:17:26.2375160Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::GdsFile:0 2025-12-04T09:17:26.2376642Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:0, line 114 <- wrt source file 2025-12-04T09:17:26.2378235Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:0 2025-12-04T09:17:26.2379761Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:1, line 125 <- wrt source file 2025-12-04T09:17:26.2381322Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:1 2025-12-04T09:17:26.2382848Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:2, line 140 <- wrt source file 2025-12-04T09:17:26.2384446Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:2 2025-12-04T09:17:26.2386063Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_multi_output_jit_fn:0, line 173 <- wrt source file 2025-12-04T09:17:26.2387968Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_multi_output_jit_fn:0 2025-12-04T09:17:26.2389339Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/profiler.py::profile:0, line 75 <- wrt source file 2025-12-04T09:17:26.2390810Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/profiler.py::profile:0 2025-12-04T09:17:26.2392382Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.composition:0, line 125 <- wrt source file 2025-12-04T09:17:26.2394108Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.composition:0 2025-12-04T09:17:26.2395671Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.complement:0, line 142 <- wrt source file 2025-12-04T09:17:26.2397418Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.complement:0 2025-12-04T09:17:26.2399245Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.remap_to_tensor:0, line 281 <- wrt source file 2025-12-04T09:17:26.2401298Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.remap_to_tensor:0 2025-12-04T09:17:26.2403054Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::DeviceMesh:0, line 167 <- wrt source file 2025-12-04T09:17:26.2404764Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::DeviceMesh:0 2025-12-04T09:17:26.2406511Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::DeviceMesh.get_local_rank:0, line 1027 <- wrt source file 2025-12-04T09:17:26.2408380Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::DeviceMesh.get_local_rank:0 2025-12-04T09:17:26.2410144Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::init_device_mesh:0, line 1317 <- wrt source file 2025-12-04T09:17:26.2411918Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::init_device_mesh:0 2025-12-04T09:17:26.2413771Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::_coalescing_manager:0, line 2652 <- wrt source file 2025-12-04T09:17:26.2415667Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::_coalescing_manager:0 2025-12-04T09:17:26.2417459Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::_time_estimator:0, line 2754 <- wrt source file 2025-12-04T09:17:26.2419309Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::_time_estimator:0 2025-12-04T09:17:26.2421151Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::batch_isend_irecv:0, line 2801 <- wrt source file 2025-12-04T09:17:26.2423046Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::batch_isend_irecv:0 2025-12-04T09:17:26.2424806Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_reduce:0, line 2938 <- wrt source file 2025-12-04T09:17:26.2426668Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_reduce:0 2025-12-04T09:17:26.2428420Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_object:0, line 3221 <- wrt source file 2025-12-04T09:17:26.2430218Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_object:0 2025-12-04T09:17:26.2432028Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::gather_object:0, line 3325 <- wrt source file 2025-12-04T09:17:26.2433873Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::gather_object:0 2025-12-04T09:17:26.2435548Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::send_object_list:0, line 3457 <- wrt source file 2025-12-04T09:17:26.2437355Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::send_object_list:0 2025-12-04T09:17:26.2439091Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::recv_object_list:0, line 3574 <- wrt source file 2025-12-04T09:17:26.2440818Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::recv_object_list:0 2025-12-04T09:17:26.2442660Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::broadcast_object_list:0, line 3719 <- wrt source file 2025-12-04T09:17:26.2444496Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::broadcast_object_list:0 2025-12-04T09:17:26.2446254Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::scatter_object_list:0, line 3844 <- wrt source file 2025-12-04T09:17:26.2448087Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::scatter_object_list:0 2025-12-04T09:17:26.2449817Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather:0, line 3947 <- wrt source file 2025-12-04T09:17:26.2451531Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather:0 2025-12-04T09:17:26.2453320Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_into_tensor:0, line 4054 <- wrt source file 2025-12-04T09:17:26.2455156Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_into_tensor:0 2025-12-04T09:17:26.2456965Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_coalesced:0, line 4192 <- wrt source file 2025-12-04T09:17:26.2458744Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_coalesced:0 2025-12-04T09:17:26.2460396Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::gather:0, line 4298 <- wrt source file 2025-12-04T09:17:26.2462311Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::gather:0 2025-12-04T09:17:26.2463924Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::scatter:0, line 4383 <- wrt source file 2025-12-04T09:17:26.2465628Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::scatter:0 2025-12-04T09:17:26.2467544Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::reduce_scatter_tensor:0, line 4521 <- wrt source file 2025-12-04T09:17:26.2469483Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::reduce_scatter_tensor:0 2025-12-04T09:17:26.2471451Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_to_all_single:0, line 4663 <- wrt source file 2025-12-04T09:17:26.2473403Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_to_all_single:0 2025-12-04T09:17:26.2475149Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_to_all:0, line 4797 <- wrt source file 2025-12-04T09:17:26.2476888Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_to_all:0 2025-12-04T09:17:26.2478634Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::monitored_barrier:0, line 5009 <- wrt source file 2025-12-04T09:17:26.2480508Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::monitored_barrier:0 2025-12-04T09:17:26.2482451Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::new_subgroups:0, line 5562 <- wrt source file 2025-12-04T09:17:26.2484293Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::new_subgroups:0 2025-12-04T09:17:26.2486196Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::new_subgroups_by_enumeration:0, line 5656 <- wrt source file 2025-12-04T09:17:26.2488225Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::new_subgroups_by_enumeration:0 2025-12-04T09:17:26.2489964Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launch.py::__doc__:0, line 84 <- wrt source file 2025-12-04T09:17:26.2491550Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launch.py::__doc__:0 2025-12-04T09:17:26.2493196Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py::__doc__:0, line 57 <- wrt source file 2025-12-04T09:17:26.2494741Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py::__doc__:0 2025-12-04T09:17:26.2496338Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/autograd/__init__.py::context:0, line 47 <- wrt source file 2025-12-04T09:17:26.2498068Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/autograd/__init__.py::context:0 2025-12-04T09:17:26.2499905Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/checkpoint_activation.py::checkpoint:0, line 53 <- wrt source file 2025-12-04T09:17:26.2501915Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/checkpoint_activation.py::checkpoint:0 2025-12-04T09:17:26.2503766Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/contract.py::contract:0, line 67 <- wrt source file 2025-12-04T09:17:26.2505578Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/contract.py::contract:0 2025-12-04T09:17:26.2507375Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/replicate.py::replicate:0, line 190 <- wrt source file 2025-12-04T09:17:26.2509365Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/replicate.py::replicate:0 2025-12-04T09:17:26.2511215Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/replicate_with_fsdp.py::replicate:0, line 265 <- wrt source file 2025-12-04T09:17:26.2513253Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/replicate_with_fsdp.py::replicate:0 2025-12-04T09:17:26.2515342Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_optim/__init__.py::named_params_with_sharded_tensor:0, line 31 <- wrt source file 2025-12-04T09:17:26.2517503Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_optim/__init__.py::named_params_with_sharded_tensor:0 2025-12-04T09:17:26.2519547Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/__init__.py::init_from_local_shards:0, line 384 <- wrt source file 2025-12-04T09:17:26.2521606Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/__init__.py::init_from_local_shards:0 2025-12-04T09:17:26.2523589Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/__init__.py::custom_sharded_op_impl:0, line 457 <- wrt source file 2025-12-04T09:17:26.2525750Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/__init__.py::custom_sharded_op_impl:0 2025-12-04T09:17:26.2527849Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py::ShardedTensor._init_from_local_tensor:0, line 860 <- wrt source file 2025-12-04T09:17:26.2530071Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py::ShardedTensor._init_from_local_tensor:0 2025-12-04T09:17:26.2532141Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py::ShardedTensor.reshard:0, line 1098 <- wrt source file 2025-12-04T09:17:26.2534308Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py::ShardedTensor.reshard:0 2025-12-04T09:17:26.2536285Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/_ops/_common.py::_sharded_op_common:0, line 18 <- wrt source file 2025-12-04T09:17:26.2538327Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/_ops/_common.py::_sharded_op_common:0 2025-12-04T09:17:26.2540261Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharding_plan/api.py::ShardingPlan:0, line 36 <- wrt source file 2025-12-04T09:17:26.2542150Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharding_plan/api.py::ShardingPlan:0 2025-12-04T09:17:26.2543972Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::put:0, line 275 <- wrt source file 2025-12-04T09:17:26.2545821Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::put:0 2025-12-04T09:17:26.2547609Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::get:0, line 328 <- wrt source file 2025-12-04T09:17:26.2549517Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::get:0 2025-12-04T09:17:26.2551457Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::get_nbi:0, line 378 <- wrt source file 2025-12-04T09:17:26.2553375Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::get_nbi:0 2025-12-04T09:17:26.2555402Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::putmem_signal_block:0, line 453 <- wrt source file 2025-12-04T09:17:26.2557503Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::putmem_signal_block:0 2025-12-04T09:17:26.2559446Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::wait_until:0, line 531 <- wrt source file 2025-12-04T09:17:26.2561409Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::wait_until:0 2025-12-04T09:17:26.2563711Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::signal_wait_until:0, line 593 <- wrt source file 2025-12-04T09:17:26.2565373Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::signal_wait_until:0 2025-12-04T09:17:26.2567028Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::signal_op:0, line 651 <- wrt source file 2025-12-04T09:17:26.2568548Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::signal_op:0 2025-12-04T09:17:26.2570594Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::fence:0, line 704 <- wrt source file 2025-12-04T09:17:26.2572510Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::fence:0 2025-12-04T09:17:26.2574144Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::quiet:0, line 750 <- wrt source file 2025-12-04T09:17:26.2575807Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::quiet:0 2025-12-04T09:17:26.2577228Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::my_pe:0, line 794 <- wrt source file 2025-12-04T09:17:26.2579017Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::my_pe:0 2025-12-04T09:17:26.2580518Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::n_pes:0, line 837 <- wrt source file 2025-12-04T09:17:26.2581598Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::n_pes:0 2025-12-04T09:17:26.2582754Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::barrier_all:0, line 888 <- wrt source file 2025-12-04T09:17:26.2583992Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::barrier_all:0 2025-12-04T09:17:26.2585103Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::sync_all:0, line 934 <- wrt source file 2025-12-04T09:17:26.2586699Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::sync_all:0 2025-12-04T09:17:26.2587898Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::alltoall:0, line 973 <- wrt source file 2025-12-04T09:17:26.2589264Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::alltoall:0 2025-12-04T09:17:26.2590487Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::broadcast:0, line 1028 <- wrt source file 2025-12-04T09:17:26.2591770Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::broadcast:0 2025-12-04T09:17:26.2593048Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::reduce:0, line 1089 <- wrt source file 2025-12-04T09:17:26.2594397Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::reduce:0 2025-12-04T09:17:26.2595428Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::reduce_extern_wrapper:0, line 1135 <- wrt source file 2025-12-04T09:17:26.2596582Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::reduce_extern_wrapper:0 2025-12-04T09:17:26.2597722Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_tools/memory_tracker.py::MemoryTracker:0, line 55 <- wrt source file 2025-12-04T09:17:26.2598853Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_tools/memory_tracker.py::MemoryTracker:0 2025-12-04T09:17:26.2599774Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/join.py::Join:0, line 141 <- wrt source file 2025-12-04T09:17:26.2600649Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/join.py::Join:0 2025-12-04T09:17:26.2601610Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/__init__.py::register_ddp_comm_hook:0, line 137 <- wrt source file 2025-12-04T09:17:26.2602882Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/__init__.py::register_ddp_comm_hook:0 2025-12-04T09:17:26.2603943Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py::noop_hook:0, line 23 <- wrt source file 2025-12-04T09:17:26.2605004Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py::noop_hook:0 2025-12-04T09:17:26.2606167Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::allreduce_hook:0, line 51 <- wrt source file 2025-12-04T09:17:26.2607273Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::allreduce_hook:0 2025-12-04T09:17:26.2608345Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::fp16_compress_hook:0, line 110 <- wrt source file 2025-12-04T09:17:26.2609446Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::fp16_compress_hook:0 2025-12-04T09:17:26.2610608Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::bf16_compress_hook:0, line 131 <- wrt source file 2025-12-04T09:17:26.2611705Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::bf16_compress_hook:0 2025-12-04T09:17:26.2612993Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::fp16_compress_wrapper:0, line 149 <- wrt source file 2025-12-04T09:17:26.2614163Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::fp16_compress_wrapper:0 2025-12-04T09:17:26.2615251Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::bf16_compress_wrapper:0, line 188 <- wrt source file 2025-12-04T09:17:26.2616376Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::bf16_compress_wrapper:0 2025-12-04T09:17:26.2617471Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py::post_localSGD_hook:0, line 91 <- wrt source file 2025-12-04T09:17:26.2618642Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py::post_localSGD_hook:0 2025-12-04T09:17:26.2619749Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py::powerSGD_hook:0, line 395 <- wrt source file 2025-12-04T09:17:26.2620820Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py::powerSGD_hook:0 2025-12-04T09:17:26.2621897Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py::batched_powerSGD_hook:0, line 708 <- wrt source file 2025-12-04T09:17:26.2623385Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py::batched_powerSGD_hook:0 2025-12-04T09:17:26.2624562Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py::quantization_pertensor_hook:0, line 64 <- wrt source file 2025-12-04T09:17:26.2625770Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py::quantization_pertensor_hook:0 2025-12-04T09:17:26.2627386Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py::quantization_perchannel_hook:0, line 146 <- wrt source file 2025-12-04T09:17:26.2628651Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py::quantization_perchannel_hook:0 2025-12-04T09:17:26.2629807Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/model_averaging/averagers.py::PeriodicModelAverager:0, line 56 <- wrt source file 2025-12-04T09:17:26.2631076Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/model_averaging/averagers.py::PeriodicModelAverager:0 2025-12-04T09:17:26.2632783Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py::HierarchicalModelAverager:0, line 53 <- wrt source file 2025-12-04T09:17:26.2634546Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py::HierarchicalModelAverager:0 2025-12-04T09:17:26.2636102Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py::BroadcastingTorchSaveReader:0, line 49 <- wrt source file 2025-12-04T09:17:26.2637449Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py::BroadcastingTorchSaveReader:0 2025-12-04T09:17:26.2638914Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py::DynamicMetaLoadPlanner:0, line 173 <- wrt source file 2025-12-04T09:17:26.2640235Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py::DynamicMetaLoadPlanner:0 2025-12-04T09:17:26.2641530Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/optimizer.py::load_sharded_optimizer_state_dict:0, line 228 <- wrt source file 2025-12-04T09:17:26.2642902Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/optimizer.py::load_sharded_optimizer_state_dict:0 2025-12-04T09:17:26.2644197Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::get_state_dict:0, line 1276 <- wrt source file 2025-12-04T09:17:26.2645446Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::get_state_dict:0 2025-12-04T09:17:26.2646864Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::_patch_model_state_dict:0, line 1531 <- wrt source file 2025-12-04T09:17:26.2648296Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::_patch_model_state_dict:0 2025-12-04T09:17:26.2649344Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::_patch_optimizer_state_dict:0, line 1590 <- wrt source file 2025-12-04T09:17:26.2650396Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::_patch_optimizer_state_dict:0 2025-12-04T09:17:26.2651375Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_loader.py::load:0, line 131 <- wrt source file 2025-12-04T09:17:26.2652731Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_loader.py::load:0 2025-12-04T09:17:26.2653665Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py::save:0, line 160 <- wrt source file 2025-12-04T09:17:26.2654619Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py::save:0 2025-12-04T09:17:26.2655548Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py::async_save:0, line 275 <- wrt source file 2025-12-04T09:17:26.2656517Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py::async_save:0 2025-12-04T09:17:26.2657958Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/barriers.py::BarrierConfig:0, line 50 <- wrt source file 2025-12-04T09:17:26.2659045Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/barriers.py::BarrierConfig:0 2025-12-04T09:17:26.2660090Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/builder.py::make_sync_checkpointer:0, line 78 <- wrt source file 2025-12-04T09:17:26.2661249Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/builder.py::make_sync_checkpointer:0 2025-12-04T09:17:26.2662536Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/builder.py::make_async_checkpointer:0, line 139 <- wrt source file 2025-12-04T09:17:26.2664183Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/builder.py::make_async_checkpointer:0 2025-12-04T09:17:26.2665376Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::SyncCheckpointer:0, line 104 <- wrt source file 2025-12-04T09:17:26.2666509Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::SyncCheckpointer:0 2025-12-04T09:17:26.2667608Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::SyncCheckpointer.save:0, line 142 <- wrt source file 2025-12-04T09:17:26.2669140Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::SyncCheckpointer.save:0 2025-12-04T09:17:26.2670370Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::AsyncCheckpointer:0, line 213 <- wrt source file 2025-12-04T09:17:26.2671501Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::AsyncCheckpointer:0 2025-12-04T09:17:26.2672599Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::AsyncCheckpointer.save:0, line 260 <- wrt source file 2025-12-04T09:17:26.2673741Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::AsyncCheckpointer.save:0 2025-12-04T09:17:26.2674819Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/staging.py::DefaultStager.close:0, line 211 <- wrt source file 2025-12-04T09:17:26.2675897Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/staging.py::DefaultStager.close:0 2025-12-04T09:17:26.2676946Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/events/__init__.py::construct_and_record_rdzv_event:0, line 110 <- wrt source file 2025-12-04T09:17:26.2678016Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/events/__init__.py::construct_and_record_rdzv_event:0 2025-12-04T09:17:26.2679078Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py::RendezvousHandler.shutdown:0, line 232 <- wrt source file 2025-12-04T09:17:26.2680146Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py::RendezvousHandler.shutdown:0 2025-12-04T09:17:26.2681155Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/utils/distributed.py::get_free_port:0, line 140 <- wrt source file 2025-12-04T09:17:26.2682147Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/utils/distributed.py::get_free_port:0 2025-12-04T09:17:26.2683054Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py::MixedPrecision:0, line 202 <- wrt source file 2025-12-04T09:17:26.2684352Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py::MixedPrecision:0 2025-12-04T09:17:26.2685396Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py::StateDictType:0, line 262 <- wrt source file 2025-12-04T09:17:26.2686270Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py::StateDictType:0 2025-12-04T09:17:26.2687315Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel:0, line 125 <- wrt source file 2025-12-04T09:17:26.2688724Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel:0 2025-12-04T09:17:26.2690057Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.set_state_dict_type:0, line 651 <- wrt source file 2025-12-04T09:17:26.2691303Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.set_state_dict_type:0 2025-12-04T09:17:26.2692633Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.state_dict_type:0, line 805 <- wrt source file 2025-12-04T09:17:26.2694070Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.state_dict_type:0 2025-12-04T09:17:26.2695981Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.shard_full_optim_state_dict:0, line 1513 <- wrt source file 2025-12-04T09:17:26.2697349Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.shard_full_optim_state_dict:0 2025-12-04T09:17:26.2698645Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.scatter_full_optim_state_dict:0, line 1633 <- wrt source file 2025-12-04T09:17:26.2699964Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.scatter_full_optim_state_dict:0 2025-12-04T09:17:26.2701232Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.rekey_optim_state_dict:0, line 1718 <- wrt source file 2025-12-04T09:17:26.2702643Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.rekey_optim_state_dict:0 2025-12-04T09:17:26.2703858Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.optim_state_dict:0, line 1850 <- wrt source file 2025-12-04T09:17:26.2705082Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.optim_state_dict:0 2025-12-04T09:17:26.2706450Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.optim_state_dict_to_load:0, line 1937 <- wrt source file 2025-12-04T09:17:26.2707736Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.optim_state_dict_to_load:0 2025-12-04T09:17:26.2708925Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/sharded_grad_scaler.py::ShardedGradScaler:0, line 57 <- wrt source file 2025-12-04T09:17:26.2710094Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/sharded_grad_scaler.py::ShardedGradScaler:0 2025-12-04T09:17:26.2711228Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py::CustomPolicy:0, line 227 <- wrt source file 2025-12-04T09:17:26.2712151Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py::CustomPolicy:0 2025-12-04T09:17:26.2713019Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/functional.py::_all_gather_base:0, line 134 <- wrt source file 2025-12-04T09:17:26.2713932Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/functional.py::_all_gather_base:0 2025-12-04T09:17:26.2714865Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::_RemoteModule.__init__:0, line 196 <- wrt source file 2025-12-04T09:17:26.2716137Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::_RemoteModule.__init__:0 2025-12-04T09:17:26.2717228Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::_RemoteModule.init_from_module_rref:0, line 527 <- wrt source file 2025-12-04T09:17:26.2718319Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::_RemoteModule.init_from_module_rref:0 2025-12-04T09:17:26.2719535Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::RemoteModule:0, line 658 <- wrt source file 2025-12-04T09:17:26.2720520Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::RemoteModule:0 2025-12-04T09:17:26.2721528Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/apply_optimizer_in_backward.py::_apply_optimizer_in_backward:0, line 43 <- wrt source file 2025-12-04T09:17:26.2722656Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/apply_optimizer_in_backward.py::_apply_optimizer_in_backward:0 2025-12-04T09:17:26.2723769Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/apply_optimizer_in_backward.py::_get_in_backward_optimizers:0, line 114 <- wrt source file 2025-12-04T09:17:26.2725131Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/apply_optimizer_in_backward.py::_get_in_backward_optimizers:0 2025-12-04T09:17:26.2726175Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py::_NamedOptimizer:0, line 43 <- wrt source file 2025-12-04T09:17:26.2727149Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py::_NamedOptimizer:0 2025-12-04T09:17:26.2728122Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/optimizer.py::DistributedOptimizer:0, line 161 <- wrt source file 2025-12-04T09:17:26.2729363Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/optimizer.py::DistributedOptimizer:0 2025-12-04T09:17:26.2730404Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/post_localSGD_optimizer.py::PostLocalSGDOptimizer:0, line 19 <- wrt source file 2025-12-04T09:17:26.2731564Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/post_localSGD_optimizer.py::PostLocalSGDOptimizer:0 2025-12-04T09:17:26.2732684Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/utils.py::register_functional_optim:0, line 37 <- wrt source file 2025-12-04T09:17:26.2733651Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/utils.py::register_functional_optim:0 2025-12-04T09:17:26.2734753Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/zero_redundancy_optimizer.py::ZeroRedundancyOptimizer:0, line 341 <- wrt source file 2025-12-04T09:17:26.2736154Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/zero_redundancy_optimizer.py::ZeroRedundancyOptimizer:0 2025-12-04T09:17:26.2737126Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py::pipe_split:0, line 345 <- wrt source file 2025-12-04T09:17:26.2738023Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py::pipe_split:0 2025-12-04T09:17:26.2738945Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::_CustomReducer:0, line 36 <- wrt source file 2025-12-04T09:17:26.2740241Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::_CustomReducer:0 2025-12-04T09:17:26.2741290Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::TensorChunkSpec.from_tuple:0, line 85 <- wrt source file 2025-12-04T09:17:26.2742361Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::TensorChunkSpec.from_tuple:0 2025-12-04T09:17:26.2743402Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::TensorChunkSpec.from_dict:0, line 104 <- wrt source file 2025-12-04T09:17:26.2744689Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::TensorChunkSpec.from_dict:0 2025-12-04T09:17:26.2745634Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::_wait_all:0, line 174 <- wrt source file 2025-12-04T09:17:26.2746462Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::_wait_all:0 2025-12-04T09:17:26.2747260Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::shutdown:0, line 343 <- wrt source file 2025-12-04T09:17:26.2748080Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::shutdown:0 2025-12-04T09:17:26.2748860Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::remote:0, line 605 <- wrt source file 2025-12-04T09:17:26.2749657Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::remote:0 2025-12-04T09:17:26.2750442Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::rpc_sync:0, line 786 <- wrt source file 2025-12-04T09:17:26.2751252Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::rpc_sync:0 2025-12-04T09:17:26.2752068Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::rpc_async:0, line 878 <- wrt source file 2025-12-04T09:17:26.2753148Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::rpc_async:0 2025-12-04T09:17:26.2754127Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/functions.py::async_execution:0, line 34 <- wrt source file 2025-12-04T09:17:26.2755044Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/functions.py::async_execution:0 2025-12-04T09:17:26.2756698Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/options.py::TensorPipeRpcBackendOptions.set_device_map:0, line 126 <- wrt source file 2025-12-04T09:17:26.2758142Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/options.py::TensorPipeRpcBackendOptions.set_device_map:0 2025-12-04T09:17:26.2759291Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/server_process_global_profiler.py::_server_process_global_profile:0, line 62 <- wrt source file 2025-12-04T09:17:26.2760450Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/server_process_global_profiler.py::_server_process_global_profile:0 2025-12-04T09:17:26.2761444Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_api.py::_shard_tensor:0, line 887 <- wrt source file 2025-12-04T09:17:26.2762507Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_api.py::_shard_tensor:0 2025-12-04T09:17:26.2763518Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::ShardOrderEntry:0, line 32 <- wrt source file 2025-12-04T09:17:26.2764842Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::ShardOrderEntry:0 2025-12-04T09:17:26.2765906Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec._convert_shard_order_to_StridedShard:0, line 165 <- wrt source file 2025-12-04T09:17:26.2767073Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec._convert_shard_order_to_StridedShard:0 2025-12-04T09:17:26.2768433Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec._maybe_convert_StridedShard_to_shard_order:0, line 241 <- wrt source file 2025-12-04T09:17:26.2769661Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec._maybe_convert_StridedShard_to_shard_order:0 2025-12-04T09:17:26.2770777Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec.format_shard_order_str:0, line 461 <- wrt source file 2025-12-04T09:17:26.2771859Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec.format_shard_order_str:0 2025-12-04T09:17:26.2773041Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_random.py::OffsetBasedRNGTracker._set_pre_op_offset:0, line 310 <- wrt source file 2025-12-04T09:17:26.2774149Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_random.py::OffsetBasedRNGTracker._set_pre_op_offset:0 2025-12-04T09:17:26.2775164Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_ops/_common_rules.py::pointwise_rule:0, line 234 <- wrt source file 2025-12-04T09:17:26.2776397Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_ops/_common_rules.py::pointwise_rule:0 2025-12-04T09:17:26.2777756Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_func_map.py::local_map:0, line 103 <- wrt source file 2025-12-04T09:17:26.2778779Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_func_map.py::local_map:0 2025-12-04T09:17:26.2779819Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_register_sharding.py::register_sharding:0, line 46 <- wrt source file 2025-12-04T09:17:26.2781266Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_register_sharding.py::register_sharding:0 2025-12-04T09:17:26.2782728Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_LoadBalancer._generate_indices:0, line 30 <- wrt source file 2025-12-04T09:17:26.2784039Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_LoadBalancer._generate_indices:0 2025-12-04T09:17:26.2785332Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_HeadTailLoadBalancer._generate_indices:0, line 102 <- wrt source file 2025-12-04T09:17:26.2786807Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_HeadTailLoadBalancer._generate_indices:0 2025-12-04T09:17:26.2788673Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PerDocumentHeadTailLoadBalancer._generate_indices:0, line 213 <- wrt source file 2025-12-04T09:17:26.2790374Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PerDocumentHeadTailLoadBalancer._generate_indices:0 2025-12-04T09:17:26.2791724Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PTRRLoadBalancer.ptrr_scheduling:0, line 339 <- wrt source file 2025-12-04T09:17:26.2793264Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PTRRLoadBalancer.ptrr_scheduling:0 2025-12-04T09:17:26.2795039Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PTRRLoadBalancer._generate_indices:0, line 397 <- wrt source file 2025-12-04T09:17:26.2796741Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PTRRLoadBalancer._generate_indices:0 2025-12-04T09:17:26.2797884Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/api.py::parallelize_module:0, line 55 <- wrt source file 2025-12-04T09:17:26.2798879Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/api.py::parallelize_module:0 2025-12-04T09:17:26.2800066Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/ddp.py::_pre_dp_module_transform:0, line 88 <- wrt source file 2025-12-04T09:17:26.2801369Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/ddp.py::_pre_dp_module_transform:0 2025-12-04T09:17:26.2802344Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/loss.py::loss_parallel:0, line 56 <- wrt source file 2025-12-04T09:17:26.2803392Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/loss.py::loss_parallel:0 2025-12-04T09:17:26.2804345Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::ColwiseParallel:0, line 64 <- wrt source file 2025-12-04T09:17:26.2805569Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::ColwiseParallel:0 2025-12-04T09:17:26.2806839Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::RowwiseParallel:0, line 198 <- wrt source file 2025-12-04T09:17:26.2807936Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::RowwiseParallel:0 2025-12-04T09:17:26.2808910Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::SequenceParallel:0, line 350 <- wrt source file 2025-12-04T09:17:26.2809902Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::SequenceParallel:0 2025-12-04T09:17:26.2811320Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleInput:0, line 452 <- wrt source file 2025-12-04T09:17:26.2812779Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleInput:0 2025-12-04T09:17:26.2814031Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleOutput:0, line 614 <- wrt source file 2025-12-04T09:17:26.2815084Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleOutput:0 2025-12-04T09:17:26.2816118Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleInputOutput:0, line 740 <- wrt source file 2025-12-04T09:17:26.2817174Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleInputOutput:0 2025-12-04T09:17:26.2818428Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/bernoulli.py::Bernoulli:0, line 30 <- wrt source file 2025-12-04T09:17:26.2819449Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/bernoulli.py::Bernoulli:0 2025-12-04T09:17:26.2820256Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/beta.py::Beta:0, line 21 <- wrt source file 2025-12-04T09:17:26.2821037Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/beta.py::Beta:0 2025-12-04T09:17:26.2821832Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/binomial.py::Binomial:0, line 31 <- wrt source file 2025-12-04T09:17:26.2822684Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/binomial.py::Binomial:0 2025-12-04T09:17:26.2823847Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/categorical.py::Categorical:0, line 42 <- wrt source file 2025-12-04T09:17:26.2824932Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/categorical.py::Categorical:0 2025-12-04T09:17:26.2825776Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/cauchy.py::Cauchy:0, line 23 <- wrt source file 2025-12-04T09:17:26.2826588Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/cauchy.py::Cauchy:0 2025-12-04T09:17:26.2827435Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/chi2.py::Chi2:0, line 18 <- wrt source file 2025-12-04T09:17:26.2828210Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/chi2.py::Chi2:0 2025-12-04T09:17:26.2829253Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/constraints.py::is_dependent:0, line 167 <- wrt source file 2025-12-04T09:17:26.2830955Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/constraints.py::is_dependent:0 2025-12-04T09:17:26.2832352Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/constraints.py::_DependentProperty:0, line 188 <- wrt source file 2025-12-04T09:17:26.2833437Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/constraints.py::_DependentProperty:0 2025-12-04T09:17:26.2834448Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/continuous_bernoulli.py::ContinuousBernoulli:0, line 35 <- wrt source file 2025-12-04T09:17:26.2835474Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/continuous_bernoulli.py::ContinuousBernoulli:0 2025-12-04T09:17:26.2836394Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/dirichlet.py::Dirichlet:0, line 44 <- wrt source file 2025-12-04T09:17:26.2837486Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/dirichlet.py::Dirichlet:0 2025-12-04T09:17:26.2838363Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/exponential.py::Exponential:0, line 20 <- wrt source file 2025-12-04T09:17:26.2839249Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/exponential.py::Exponential:0 2025-12-04T09:17:26.2840189Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/fishersnedecor.py::FisherSnedecor:0, line 21 <- wrt source file 2025-12-04T09:17:26.2841156Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/fishersnedecor.py::FisherSnedecor:0 2025-12-04T09:17:26.2842012Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/gamma.py::Gamma:0, line 24 <- wrt source file 2025-12-04T09:17:26.2842820Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/gamma.py::Gamma:0 2025-12-04T09:17:26.2843700Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/generalized_pareto.py::GeneralizedPareto:0, line 26 <- wrt source file 2025-12-04T09:17:26.2844693Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/generalized_pareto.py::GeneralizedPareto:0 2025-12-04T09:17:26.2845597Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/geometric.py::Geometric:0, line 36 <- wrt source file 2025-12-04T09:17:26.2846448Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/geometric.py::Geometric:0 2025-12-04T09:17:26.2847425Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/gumbel.py::Gumbel:0, line 23 <- wrt source file 2025-12-04T09:17:26.2848265Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/gumbel.py::Gumbel:0 2025-12-04T09:17:26.2849099Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/half_cauchy.py::HalfCauchy:0, line 24 <- wrt source file 2025-12-04T09:17:26.2849974Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/half_cauchy.py::HalfCauchy:0 2025-12-04T09:17:26.2850888Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/half_normal.py::HalfNormal:0, line 24 <- wrt source file 2025-12-04T09:17:26.2851756Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/half_normal.py::HalfNormal:0 2025-12-04T09:17:26.2852743Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/independent.py::Independent:0, line 27 <- wrt source file 2025-12-04T09:17:26.2853703Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/independent.py::Independent:0 2025-12-04T09:17:26.2854778Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/inverse_gamma.py::InverseGamma:0, line 24 <- wrt source file 2025-12-04T09:17:26.2855713Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/inverse_gamma.py::InverseGamma:0 2025-12-04T09:17:26.2856600Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/kumaraswamy.py::Kumaraswamy:0, line 30 <- wrt source file 2025-12-04T09:17:26.2857487Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/kumaraswamy.py::Kumaraswamy:0 2025-12-04T09:17:26.2858319Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/laplace.py::Laplace:0, line 20 <- wrt source file 2025-12-04T09:17:26.2859203Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/laplace.py::Laplace:0 2025-12-04T09:17:26.2860216Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/lkj_cholesky.py::LKJCholesky:0, line 43 <- wrt source file 2025-12-04T09:17:26.2861368Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/lkj_cholesky.py::LKJCholesky:0 2025-12-04T09:17:26.2862650Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/log_normal.py::LogNormal:0, line 23 <- wrt source file 2025-12-04T09:17:26.2863884Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/log_normal.py::LogNormal:0 2025-12-04T09:17:26.2864913Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/logistic_normal.py::LogisticNormal:0, line 28 <- wrt source file 2025-12-04T09:17:26.2866081Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/logistic_normal.py::LogisticNormal:0 2025-12-04T09:17:26.2867241Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/lowrank_multivariate_normal.py::LowRankMultivariateNormal:0, line 63 <- wrt source file 2025-12-04T09:17:26.2868642Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/lowrank_multivariate_normal.py::LowRankMultivariateNormal:0 2025-12-04T09:17:26.2869939Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/mixture_same_family.py::MixtureSameFamily:0, line 24 <- wrt source file 2025-12-04T09:17:26.2871238Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/mixture_same_family.py::MixtureSameFamily:0 2025-12-04T09:17:26.2872622Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/multinomial.py::Multinomial:0, line 38 <- wrt source file 2025-12-04T09:17:26.2874112Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/multinomial.py::Multinomial:0 2025-12-04T09:17:26.2875117Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py::MultivariateNormal:0, line 103 <- wrt source file 2025-12-04T09:17:26.2876252Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py::MultivariateNormal:0 2025-12-04T09:17:26.2877267Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/normal.py::Normal:0, line 22 <- wrt source file 2025-12-04T09:17:26.2878108Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/normal.py::Normal:0 2025-12-04T09:17:26.2879085Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/one_hot_categorical.py::OneHotCategorical:0, line 34 <- wrt source file 2025-12-04T09:17:26.2880163Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/one_hot_categorical.py::OneHotCategorical:0 2025-12-04T09:17:26.2881619Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/pareto.py::Pareto:0, line 20 <- wrt source file 2025-12-04T09:17:26.2882487Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/pareto.py::Pareto:0 2025-12-04T09:17:26.2883541Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/poisson.py::Poisson:0, line 25 <- wrt source file 2025-12-04T09:17:26.2884616Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/poisson.py::Poisson:0 2025-12-04T09:17:26.2885711Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/relaxed_bernoulli.py::RelaxedBernoulli:0, line 130 <- wrt source file 2025-12-04T09:17:26.2886869Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/relaxed_bernoulli.py::RelaxedBernoulli:0 2025-12-04T09:17:26.2888115Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/relaxed_categorical.py::RelaxedOneHotCategorical:0, line 117 <- wrt source file 2025-12-04T09:17:26.2889416Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/relaxed_categorical.py::RelaxedOneHotCategorical:0 2025-12-04T09:17:26.2890358Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/studentT.py::StudentT:0, line 22 <- wrt source file 2025-12-04T09:17:26.2891224Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/studentT.py::StudentT:0 2025-12-04T09:17:26.2892092Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::CatTransform:0, line 1076 <- wrt source file 2025-12-04T09:17:26.2893106Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::CatTransform:0 2025-12-04T09:17:26.2893987Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::StackTransform:0, line 1190 <- wrt source file 2025-12-04T09:17:26.2894896Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::StackTransform:0 2025-12-04T09:17:26.2895861Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::CumulativeDistributionTransform:0, line 1268 <- wrt source file 2025-12-04T09:17:26.2896912Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::CumulativeDistributionTransform:0 2025-12-04T09:17:26.2897815Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/uniform.py::Uniform:0, line 21 <- wrt source file 2025-12-04T09:17:26.2898658Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/uniform.py::Uniform:0 2025-12-04T09:17:26.2899550Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/utils.py::clamp_probs:0, line 114 <- wrt source file 2025-12-04T09:17:26.2900387Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/utils.py::clamp_probs:0 2025-12-04T09:17:26.2901195Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/von_mises.py::VonMises:0, line 119 <- wrt source file 2025-12-04T09:17:26.2902106Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/von_mises.py::VonMises:0 2025-12-04T09:17:26.2902969Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/weibull.py::Weibull:0, line 22 <- wrt source file 2025-12-04T09:17:26.2903788Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/weibull.py::Weibull:0 2025-12-04T09:17:26.2904587Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/wishart.py::Wishart:0, line 39 <- wrt source file 2025-12-04T09:17:26.2905394Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/wishart.py::Wishart:0 2025-12-04T09:17:26.2906219Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/_unlift.py::_convert_guards_code_to_fn:0, line 158 <- wrt source file 2025-12-04T09:17:26.2907140Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/_unlift.py::_convert_guards_code_to_fn:0 2025-12-04T09:17:26.2907966Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::Dim:0, line 123 <- wrt source file 2025-12-04T09:17:26.2908763Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::Dim:0 2025-12-04T09:17:26.2909602Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::ShapesCollection:0, line 737 <- wrt source file 2025-12-04T09:17:26.2910498Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::ShapesCollection:0 2025-12-04T09:17:26.2911369Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::ShapesCollection:1, line 753 <- wrt source file 2025-12-04T09:17:26.2912256Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::ShapesCollection:1 2025-12-04T09:17:26.2913114Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::AdditionalInputs:0, line 837 <- wrt source file 2025-12-04T09:17:26.2913999Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::AdditionalInputs:0 2025-12-04T09:17:26.2914792Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::_snake_case:0, line 104 <- wrt source file 2025-12-04T09:17:26.2915542Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::_snake_case:0 2025-12-04T09:17:26.2916600Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::Graph.eliminate_dead_code:0, line 2043 <- wrt source file 2025-12-04T09:17:26.2917497Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::Graph.eliminate_dead_code:0 2025-12-04T09:17:26.2918321Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::Graph.on_generate_code:0, line 2137 <- wrt source file 2025-12-04T09:17:26.2919171Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::Graph.on_generate_code:0 2025-12-04T09:17:26.2920158Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/interpreter.py::Interpreter:0, line 75 <- wrt source file 2025-12-04T09:17:26.2921100Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/interpreter.py::Interpreter:0 2025-12-04T09:17:26.2921892Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/interpreter.py::Transformer:0, line 519 <- wrt source file 2025-12-04T09:17:26.2922764Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/interpreter.py::Transformer:0 2025-12-04T09:17:26.2923699Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/subgraph_rewriter.py::replace_pattern:0, line 126 <- wrt source file 2025-12-04T09:17:26.2924766Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/subgraph_rewriter.py::replace_pattern:0 2025-12-04T09:17:26.2925581Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::TensorType:0, line 12 <- wrt source file 2025-12-04T09:17:26.2926370Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::TensorType:0 2025-12-04T09:17:26.2927147Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::is_consistent:0, line 65 <- wrt source file 2025-12-04T09:17:26.2928188Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::is_consistent:0 2025-12-04T09:17:26.2929040Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::is_more_precise:0, line 93 <- wrt source file 2025-12-04T09:17:26.2929855Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::is_more_precise:0 2025-12-04T09:17:26.2930640Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/traceback.py::annotate:0, line 300 <- wrt source file 2025-12-04T09:17:26.2931663Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/traceback.py::annotate:0 2025-12-04T09:17:26.2932534Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/traceback.py::annotate_fn:0, line 344 <- wrt source file 2025-12-04T09:17:26.2933328Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/traceback.py::annotate_fn:0 2025-12-04T09:17:26.2934211Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/rewriter.py::AST_Rewriter.visit_AnnAssign:0, line 97 <- wrt source file 2025-12-04T09:17:26.2935197Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/rewriter.py::AST_Rewriter.visit_AnnAssign:0 2025-12-04T09:17:26.2936408Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/core.py::reify:0, line 58 <- wrt source file 2025-12-04T09:17:26.2937330Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/core.py::reify:0 2025-12-04T09:17:26.2938256Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/match.py::VarDispatcher:0, line 48 <- wrt source file 2025-12-04T09:17:26.2939223Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/match.py::VarDispatcher:0 2025-12-04T09:17:26.2940385Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::unifiable:0, line 19 <- wrt source file 2025-12-04T09:17:26.2941354Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::unifiable:0 2025-12-04T09:17:26.2942336Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::reify_object:0, line 45 <- wrt source file 2025-12-04T09:17:26.2943276Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::reify_object:0 2025-12-04T09:17:26.2944194Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::unify_object:0, line 102 <- wrt source file 2025-12-04T09:17:26.2945458Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::unify_object:0 2025-12-04T09:17:26.2946510Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::merge:0, line 37 <- wrt source file 2025-12-04T09:17:26.2947526Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::merge:0 2025-12-04T09:17:26.2948741Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::merge_with:0, line 64 <- wrt source file 2025-12-04T09:17:26.2949799Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::merge_with:0 2025-12-04T09:17:26.2950792Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::valmap:0, line 90 <- wrt source file 2025-12-04T09:17:26.2951867Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::valmap:0 2025-12-04T09:17:26.2952854Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::keymap:0, line 106 <- wrt source file 2025-12-04T09:17:26.2954103Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::keymap:0 2025-12-04T09:17:26.2955103Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::itemmap:0, line 122 <- wrt source file 2025-12-04T09:17:26.2956125Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::itemmap:0 2025-12-04T09:17:26.2957193Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::valfilter:0, line 138 <- wrt source file 2025-12-04T09:17:26.2958424Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::valfilter:0 2025-12-04T09:17:26.2959430Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::keyfilter:0, line 158 <- wrt source file 2025-12-04T09:17:26.2960474Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::keyfilter:0 2025-12-04T09:17:26.2961488Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::itemfilter:0, line 178 <- wrt source file 2025-12-04T09:17:26.2962678Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::itemfilter:0 2025-12-04T09:17:26.2963672Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::assoc:0, line 204 <- wrt source file 2025-12-04T09:17:26.2964897Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::assoc:0 2025-12-04T09:17:26.2966040Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::dissoc:0, line 221 <- wrt source file 2025-12-04T09:17:26.2967056Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::dissoc:0 2025-12-04T09:17:26.2968040Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::assoc_in:0, line 247 <- wrt source file 2025-12-04T09:17:26.2969163Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::assoc_in:0 2025-12-04T09:17:26.2970162Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::update_in:0, line 275 <- wrt source file 2025-12-04T09:17:26.2971199Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::update_in:0 2025-12-04T09:17:26.2972192Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::get_in:0, line 329 <- wrt source file 2025-12-04T09:17:26.2974989Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::get_in:0 2025-12-04T09:17:26.2976116Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::groupby:0, line 376 <- wrt source file 2025-12-04T09:17:26.2979152Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::groupby:0 2025-12-04T09:17:26.2980184Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::first:0, line 417 <- wrt source file 2025-12-04T09:17:26.2982401Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::first:0 2025-12-04T09:17:26.2983411Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::transitive_get:0, line 15 <- wrt source file 2025-12-04T09:17:26.2986768Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::transitive_get:0 2025-12-04T09:17:26.2987793Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::_toposort:0, line 42 <- wrt source file 2025-12-04T09:17:26.2988753Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::_toposort:0 2025-12-04T09:17:26.2989694Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::reverse_dict:0, line 70 <- wrt source file 2025-12-04T09:17:26.2991820Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::reverse_dict:0 2025-12-04T09:17:26.2992773Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::freeze:0, line 95 <- wrt source file 2025-12-04T09:17:26.2995676Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::freeze:0 2025-12-04T09:17:26.2996628Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/variable.py::variables:0, line 67 <- wrt source file 2025-12-04T09:17:26.2998359Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/variable.py::variables:0 2025-12-04T09:17:26.2999455Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/core.py::dispatch:0, line 28 <- wrt source file 2025-12-04T09:17:26.3002216Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/core.py::dispatch:0 2025-12-04T09:17:26.3003409Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher:0, line 113 <- wrt source file 2025-12-04T09:17:26.3004756Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher:0 2025-12-04T09:17:26.3006020Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.register:0, line 138 <- wrt source file 2025-12-04T09:17:26.3007255Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.register:0 2025-12-04T09:17:26.3008539Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.add:0, line 191 <- wrt source file 2025-12-04T09:17:26.3009712Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.add:0 2025-12-04T09:17:26.3010933Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.dispatch:0, line 305 <- wrt source file 2025-12-04T09:17:26.3012286Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.dispatch:0 2025-12-04T09:17:26.3013583Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::str_signature:0, line 436 <- wrt source file 2025-12-04T09:17:26.3014750Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::str_signature:0 2025-12-04T09:17:26.3016136Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::expand_tuples:0, line 18 <- wrt source file 2025-12-04T09:17:26.3017321Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::expand_tuples:0 2025-12-04T09:17:26.3018396Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::_toposort:0, line 41 <- wrt source file 2025-12-04T09:17:26.3021224Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::_toposort:0 2025-12-04T09:17:26.3022327Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::reverse_dict:0, line 68 <- wrt source file 2025-12-04T09:17:26.3025173Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::reverse_dict:0 2025-12-04T09:17:26.3026291Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::groupby:0, line 87 <- wrt source file 2025-12-04T09:17:26.3029475Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::groupby:0 2025-12-04T09:17:26.3030667Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::typename:0, line 117 <- wrt source file 2025-12-04T09:17:26.3033777Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::typename:0 2025-12-04T09:17:26.3034890Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/variadic.py::isvariadic:0, line 47 <- wrt source file 2025-12-04T09:17:26.3036322Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/variadic.py::isvariadic:0 2025-12-04T09:17:26.3037500Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/variadic.py::Variadic:0, line 83 <- wrt source file 2025-12-04T09:17:26.3038626Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/variadic.py::Variadic:0 2025-12-04T09:17:26.3039665Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0, line 129 <- wrt source file 2025-12-04T09:17:26.3089052Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0 2025-12-04T09:17:26.3090385Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/shape_prop.py::ShapeProp:0, line 99 <- wrt source file 2025-12-04T09:17:26.3091901Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/shape_prop.py::ShapeProp:0 2025-12-04T09:17:26.3093637Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/split_module.py::split_module:0, line 94 <- wrt source file 2025-12-04T09:17:26.3095212Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/split_module.py::split_module:0 2025-12-04T09:17:26.3096901Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/utils/matcher_with_name_node_map_utils.py::SubgraphMatcherWithNameNodeMap:0, line 51 <- wrt source file 2025-12-04T09:17:26.3098580Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/utils/matcher_with_name_node_map_utils.py::SubgraphMatcherWithNameNodeMap:0 2025-12-04T09:17:26.3099868Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/_check.py::AttributeTypeIsSupportedChecker:0, line 37 <- wrt source file 2025-12-04T09:17:26.3100871Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/_check.py::AttributeTypeIsSupportedChecker:0 2025-12-04T09:17:26.3101761Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_load_for_lite_interpreter:0, line 22 <- wrt source file 2025-12-04T09:17:26.3102675Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_load_for_lite_interpreter:0 2025-12-04T09:17:26.3103600Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_get_mobile_model_contained_types:0, line 125 <- wrt source file 2025-12-04T09:17:26.3104562Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_get_mobile_model_contained_types:0 2025-12-04T09:17:26.3105466Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_get_model_ops_and_info:0, line 225 <- wrt source file 2025-12-04T09:17:26.3106352Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_get_model_ops_and_info:0 2025-12-04T09:17:26.3107284Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/masked/_ops.py::logaddexp:0, line 1538 <- wrt source file 2025-12-04T09:17:26.3114786Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/masked/_ops.py::logaddexp:0 2025-12-04T09:17:26.3115693Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/masked/maskedtensor/core.py::is_masked_tensor:0, line 25 <- wrt source file 2025-12-04T09:17:26.3116730Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/masked/maskedtensor/core.py::is_masked_tensor:0 2025-12-04T09:17:26.3117734Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::fractional_max_pool2d_with_indices:0, line 470 <- wrt source file 2025-12-04T09:17:26.3153292Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::fractional_max_pool2d_with_indices:0 2025-12-04T09:17:26.3154545Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::fractional_max_pool3d_with_indices:0, line 589 <- wrt source file 2025-12-04T09:17:26.3790990Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::fractional_max_pool3d_with_indices:0 2025-12-04T09:17:26.3816467Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::gumbel_softmax:0, line 2198 <- wrt source file 2025-12-04T09:17:26.3826478Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::gumbel_softmax:0 2025-12-04T09:17:26.3827607Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::embedding:0, line 2503 <- wrt source file 2025-12-04T09:17:26.3846264Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::embedding:0 2025-12-04T09:17:26.3847495Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::embedding_bag:0, line 2645 <- wrt source file 2025-12-04T09:17:26.3848652Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::embedding_bag:0 2025-12-04T09:17:26.3849723Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::ctc_loss:0, line 3087 <- wrt source file 2025-12-04T09:17:26.3862019Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::ctc_loss:0 2025-12-04T09:17:26.3863102Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::nll_loss:0, line 3157 <- wrt source file 2025-12-04T09:17:26.3868769Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::nll_loss:0 2025-12-04T09:17:26.3869871Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::cross_entropy:0, line 3476 <- wrt source file 2025-12-04T09:17:26.3877703Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::cross_entropy:0 2025-12-04T09:17:26.3878840Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::binary_cross_entropy:0, line 3542 <- wrt source file 2025-12-04T09:17:26.3884389Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::binary_cross_entropy:0 2025-12-04T09:17:26.3885599Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::binary_cross_entropy_with_logits:0, line 3613 <- wrt source file 2025-12-04T09:17:26.3891043Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::binary_cross_entropy_with_logits:0 2025-12-04T09:17:26.3892511Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::pad:0, line 5387 <- wrt source file 2025-12-04T09:17:26.3902437Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::pad:0 2025-12-04T09:17:26.3903584Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv1d_input:0, line 32 <- wrt source file 2025-12-04T09:17:26.3911291Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv1d_input:0 2025-12-04T09:17:26.3912678Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv1d_weight:0, line 79 <- wrt source file 2025-12-04T09:17:26.3916637Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv1d_weight:0 2025-12-04T09:17:26.3917800Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv2d_input:0, line 130 <- wrt source file 2025-12-04T09:17:26.3923942Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv2d_input:0 2025-12-04T09:17:26.3925098Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv2d_weight:0, line 177 <- wrt source file 2025-12-04T09:17:26.3929214Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv2d_weight:0 2025-12-04T09:17:26.3930468Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv3d_input:0, line 228 <- wrt source file 2025-12-04T09:17:26.3956113Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv3d_input:0 2025-12-04T09:17:26.3957277Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv3d_weight:0, line 275 <- wrt source file 2025-12-04T09:17:26.3971990Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv3d_weight:0 2025-12-04T09:17:26.3973221Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::calculate_gain:0, line 172 <- wrt source file 2025-12-04T09:17:26.3975674Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::calculate_gain:0 2025-12-04T09:17:26.3976729Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::uniform_:0, line 231 <- wrt source file 2025-12-04T09:17:26.3979561Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::uniform_:0 2025-12-04T09:17:26.3980589Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::normal_:0, line 258 <- wrt source file 2025-12-04T09:17:26.3983406Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::normal_:0 2025-12-04T09:17:26.3988062Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::trunc_normal_:0, line 293 <- wrt source file 2025-12-04T09:17:26.3989129Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::trunc_normal_:0 2025-12-04T09:17:26.3990154Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::constant_:0, line 307 <- wrt source file 2025-12-04T09:17:26.3991535Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::constant_:0 2025-12-04T09:17:26.3992501Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::ones_:0, line 324 <- wrt source file 2025-12-04T09:17:26.3995380Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::ones_:0 2025-12-04T09:17:26.3996339Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::zeros_:0, line 337 <- wrt source file 2025-12-04T09:17:26.3998997Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::zeros_:0 2025-12-04T09:17:26.3999945Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::eye_:0, line 353 <- wrt source file 2025-12-04T09:17:26.4002883Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::eye_:0 2025-12-04T09:17:26.4003934Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::dirac_:0, line 375 <- wrt source file 2025-12-04T09:17:26.4008044Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::dirac_:0 2025-12-04T09:17:26.4009060Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::xavier_uniform_:0, line 461 <- wrt source file 2025-12-04T09:17:26.4012192Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::xavier_uniform_:0 2025-12-04T09:17:26.4013355Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::xavier_normal_:0, line 493 <- wrt source file 2025-12-04T09:17:26.4016129Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::xavier_normal_:0 2025-12-04T09:17:26.4017172Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::kaiming_uniform_:0, line 545 <- wrt source file 2025-12-04T09:17:26.4020265Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::kaiming_uniform_:0 2025-12-04T09:17:26.4021331Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::kaiming_normal_:0, line 610 <- wrt source file 2025-12-04T09:17:26.4024141Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::kaiming_normal_:0 2025-12-04T09:17:26.4025162Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::orthogonal_:0, line 649 <- wrt source file 2025-12-04T09:17:26.4026182Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::orthogonal_:0 2025-12-04T09:17:26.4027107Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::sparse_:0, line 702 <- wrt source file 2025-12-04T09:17:26.4030096Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::sparse_:0 2025-12-04T09:17:26.4031010Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/__init__.py::sdpa_kernel:0, line 124 <- wrt source file 2025-12-04T09:17:26.4032026Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/__init__.py::sdpa_kernel:0 2025-12-04T09:17:26.4033100Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/_registry.py::register_flash_attention_impl:0, line 47 <- wrt source file 2025-12-04T09:17:26.4034234Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/_registry.py::register_flash_attention_impl:0 2025-12-04T09:17:26.4035342Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/_registry.py::activate_flash_attention_impl:0, line 78 <- wrt source file 2025-12-04T09:17:26.4036464Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/_registry.py::activate_flash_attention_impl:0 2025-12-04T09:17:26.4037466Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/bias.py::CausalBias:0, line 94 <- wrt source file 2025-12-04T09:17:26.4038416Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/bias.py::CausalBias:0 2025-12-04T09:17:26.4039439Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/varlen.py::varlen_attn:0, line 166 <- wrt source file 2025-12-04T09:17:26.4040370Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/varlen.py::varlen_attn:0 2025-12-04T09:17:26.4041187Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Threshold:0, line 72 <- wrt source file 2025-12-04T09:17:26.4042077Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Threshold:0 2025-12-04T09:17:26.4042944Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ReLU:0, line 120 <- wrt source file 2025-12-04T09:17:26.4048278Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ReLU:0 2025-12-04T09:17:26.4049117Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::RReLU:0, line 185 <- wrt source file 2025-12-04T09:17:26.4053455Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::RReLU:0 2025-12-04T09:17:26.4054317Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardtanh:0, line 247 <- wrt source file 2025-12-04T09:17:26.4058211Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardtanh:0 2025-12-04T09:17:26.4059088Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ReLU6:0, line 318 <- wrt source file 2025-12-04T09:17:26.4062935Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ReLU6:0 2025-12-04T09:17:26.4064313Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Sigmoid:0, line 349 <- wrt source file 2025-12-04T09:17:26.4067319Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Sigmoid:0 2025-12-04T09:17:26.4068666Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardsigmoid:0, line 384 <- wrt source file 2025-12-04T09:17:26.4071417Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardsigmoid:0 2025-12-04T09:17:26.4072434Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Tanh:0, line 420 <- wrt source file 2025-12-04T09:17:26.4075616Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Tanh:0 2025-12-04T09:17:26.4076590Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::SiLU:0, line 456 <- wrt source file 2025-12-04T09:17:26.4079902Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::SiLU:0 2025-12-04T09:17:26.4080867Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Mish:0, line 501 <- wrt source file 2025-12-04T09:17:26.4084229Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Mish:0 2025-12-04T09:17:26.4085215Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardswish:0, line 552 <- wrt source file 2025-12-04T09:17:26.4088356Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardswish:0 2025-12-04T09:17:26.4089354Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ELU:0, line 598 <- wrt source file 2025-12-04T09:17:26.4092969Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ELU:0 2025-12-04T09:17:26.4093824Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::CELU:0, line 646 <- wrt source file 2025-12-04T09:17:26.4097164Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::CELU:0 2025-12-04T09:17:26.4098089Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::SELU:0, line 705 <- wrt source file 2025-12-04T09:17:26.4101481Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::SELU:0 2025-12-04T09:17:26.4102527Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::GLU:0, line 751 <- wrt source file 2025-12-04T09:17:26.4105813Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::GLU:0 2025-12-04T09:17:26.4106841Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::GELU:0, line 799 <- wrt source file 2025-12-04T09:17:26.4112610Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::GELU:0 2025-12-04T09:17:26.4113874Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardshrink:0, line 848 <- wrt source file 2025-12-04T09:17:26.4117316Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardshrink:0 2025-12-04T09:17:26.4118463Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LeakyReLU:0, line 903 <- wrt source file 2025-12-04T09:17:26.4121662Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LeakyReLU:0 2025-12-04T09:17:26.4122827Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LogSigmoid:0, line 945 <- wrt source file 2025-12-04T09:17:26.4125876Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LogSigmoid:0 2025-12-04T09:17:26.4127026Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softplus:0, line 981 <- wrt source file 2025-12-04T09:17:26.4130500Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softplus:0 2025-12-04T09:17:26.4131634Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softshrink:0, line 1030 <- wrt source file 2025-12-04T09:17:26.4134822Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softshrink:0 2025-12-04T09:17:26.4136033Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::MultiheadAttention:0, line 1148 <- wrt source file 2025-12-04T09:17:26.4137293Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::MultiheadAttention:0 2025-12-04T09:17:26.4138446Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::PReLU:0, line 1613 <- wrt source file 2025-12-04T09:17:26.4140765Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::PReLU:0 2025-12-04T09:17:26.4141883Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softsign:0, line 1664 <- wrt source file 2025-12-04T09:17:26.4145114Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softsign:0 2025-12-04T09:17:26.4146240Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Tanhshrink:0, line 1690 <- wrt source file 2025-12-04T09:17:26.4149166Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Tanhshrink:0 2025-12-04T09:17:26.4150360Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmin:0, line 1728 <- wrt source file 2025-12-04T09:17:26.4153514Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmin:0 2025-12-04T09:17:26.4154637Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmax:0, line 1792 <- wrt source file 2025-12-04T09:17:26.4157627Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmax:0 2025-12-04T09:17:26.4158750Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmax2d:0, line 1839 <- wrt source file 2025-12-04T09:17:26.4162167Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmax2d:0 2025-12-04T09:17:26.4163417Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LogSoftmax:0, line 1878 <- wrt source file 2025-12-04T09:17:26.4167353Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LogSoftmax:0 2025-12-04T09:17:26.4168509Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm1d:0, line 341 <- wrt source file 2025-12-04T09:17:26.4175299Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm1d:0 2025-12-04T09:17:26.4176449Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm2d:0, line 453 <- wrt source file 2025-12-04T09:17:26.4307341Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm2d:0 2025-12-04T09:17:26.4308678Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm3d:0, line 565 <- wrt source file 2025-12-04T09:17:26.5987045Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm3d:0 2025-12-04T09:17:26.6182941Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::SyncBatchNorm:0, line 690 <- wrt source file 2025-12-04T09:17:26.6185103Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::SyncBatchNorm:0 2025-12-04T09:17:26.6186911Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::SyncBatchNorm.convert_sync_batchnorm:0, line 857 <- wrt source file 2025-12-04T09:17:26.6188877Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::SyncBatchNorm.convert_sync_batchnorm:0 2025-12-04T09:17:26.6190613Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/channelshuffle.py::ChannelShuffle:0, line 21 <- wrt source file 2025-12-04T09:17:26.6213278Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/channelshuffle.py::ChannelShuffle:0 2025-12-04T09:17:26.6214652Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential:0, line 81 <- wrt source file 2025-12-04T09:17:26.6216328Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential:0 2025-12-04T09:17:26.6217677Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.append:0, line 263 <- wrt source file 2025-12-04T09:17:26.6225113Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.append:0 2025-12-04T09:17:26.6226669Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.insert:0, line 286 <- wrt source file 2025-12-04T09:17:26.6233670Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.insert:0 2025-12-04T09:17:26.6235065Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.extend:0, line 317 <- wrt source file 2025-12-04T09:17:26.6243856Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.extend:0 2025-12-04T09:17:26.6245199Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ModuleList:0, line 346 <- wrt source file 2025-12-04T09:17:26.6246514Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ModuleList:0 2025-12-04T09:17:26.6247928Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ModuleDict:0, line 529 <- wrt source file 2025-12-04T09:17:26.6249268Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ModuleDict:0 2025-12-04T09:17:26.6250565Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ParameterList:0, line 661 <- wrt source file 2025-12-04T09:17:26.6251918Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ParameterList:0 2025-12-04T09:17:26.6253395Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ParameterDict:0, line 819 <- wrt source file 2025-12-04T09:17:26.6254755Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ParameterDict:0 2025-12-04T09:17:26.6256103Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/distance.py::PairwiseDistance:0, line 38 <- wrt source file 2025-12-04T09:17:26.6257489Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/distance.py::PairwiseDistance:0 2025-12-04T09:17:26.6258825Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/distance.py::CosineSimilarity:0, line 81 <- wrt source file 2025-12-04T09:17:26.6263787Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/distance.py::CosineSimilarity:0 2025-12-04T09:17:26.6264654Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout:0, line 60 <- wrt source file 2025-12-04T09:17:26.6268798Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout:0 2025-12-04T09:17:26.6269629Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout1d:0, line 108 <- wrt source file 2025-12-04T09:17:26.6273954Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout1d:0 2025-12-04T09:17:26.6274969Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout2d:0, line 163 <- wrt source file 2025-12-04T09:17:26.6290426Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout2d:0 2025-12-04T09:17:26.6291817Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout3d:0, line 211 <- wrt source file 2025-12-04T09:17:26.6340120Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout3d:0 2025-12-04T09:17:26.6341396Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::AlphaDropout:0, line 257 <- wrt source file 2025-12-04T09:17:26.6345309Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::AlphaDropout:0 2025-12-04T09:17:26.6346497Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::FeatureAlphaDropout:0, line 309 <- wrt source file 2025-12-04T09:17:26.6396136Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::FeatureAlphaDropout:0 2025-12-04T09:17:26.6397450Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/flatten.py::Flatten:0, line 29 <- wrt source file 2025-12-04T09:17:26.6403036Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/flatten.py::Flatten:0 2025-12-04T09:17:26.6404285Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/flatten.py::Unflatten:0, line 86 <- wrt source file 2025-12-04T09:17:26.6418494Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/flatten.py::Unflatten:0 2025-12-04T09:17:26.6419635Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/fold.py::Fold:0, line 224 <- wrt source file 2025-12-04T09:17:26.6424361Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/fold.py::Fold:0 2025-12-04T09:17:26.6425541Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/fold.py::Unfold:0, line 395 <- wrt source file 2025-12-04T09:17:26.6438908Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/fold.py::Unfold:0 2025-12-04T09:17:26.6440182Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm1d:0, line 188 <- wrt source file 2025-12-04T09:17:26.6451924Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm1d:0 2025-12-04T09:17:26.6453463Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm2d:0, line 304 <- wrt source file 2025-12-04T09:17:26.6575251Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm2d:0 2025-12-04T09:17:26.6576567Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm3d:0, line 420 <- wrt source file 2025-12-04T09:17:26.8247036Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm3d:0 2025-12-04T09:17:26.8436624Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/lazy.py::LazyModuleMixin:0, line 77 <- wrt source file 2025-12-04T09:17:26.8440840Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/lazy.py::LazyModuleMixin:0 2025-12-04T09:17:26.8441773Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Identity:0, line 34 <- wrt source file 2025-12-04T09:17:26.8448994Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Identity:0 2025-12-04T09:17:26.8450257Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Linear:0, line 83 <- wrt source file 2025-12-04T09:17:26.8459082Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Linear:0 2025-12-04T09:17:26.8460407Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Bilinear:0, line 191 <- wrt source file 2025-12-04T09:17:26.8475746Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Bilinear:0 2025-12-04T09:17:26.8476934Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::L1Loss:0, line 116 <- wrt source file 2025-12-04T09:17:26.8483781Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::L1Loss:0 2025-12-04T09:17:26.8484829Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::NLLLoss:0, line 213 <- wrt source file 2025-12-04T09:17:26.8508819Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::NLLLoss:0 2025-12-04T09:17:26.8509922Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::PoissonNLLLoss:0, line 327 <- wrt source file 2025-12-04T09:17:26.8516223Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::PoissonNLLLoss:0 2025-12-04T09:17:26.8517365Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::GaussianNLLLoss:0, line 416 <- wrt source file 2025-12-04T09:17:26.8530443Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::GaussianNLLLoss:0 2025-12-04T09:17:26.8531546Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::KLDivLoss:0, line 531 <- wrt source file 2025-12-04T09:17:26.8539265Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::KLDivLoss:0 2025-12-04T09:17:26.8540312Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MSELoss:0, line 613 <- wrt source file 2025-12-04T09:17:26.8545737Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MSELoss:0 2025-12-04T09:17:26.8546796Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCELoss:0, line 696 <- wrt source file 2025-12-04T09:17:26.8552750Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCELoss:0 2025-12-04T09:17:26.8553878Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCEWithLogitsLoss:0, line 762 <- wrt source file 2025-12-04T09:17:26.8565024Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCEWithLogitsLoss:0 2025-12-04T09:17:26.8566174Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCEWithLogitsLoss:1, line 810 <- wrt source file 2025-12-04T09:17:26.8571461Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCEWithLogitsLoss:1 2025-12-04T09:17:26.8572728Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MultiLabelMarginLoss:0, line 958 <- wrt source file 2025-12-04T09:17:26.8580516Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MultiLabelMarginLoss:0 2025-12-04T09:17:26.8581693Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CrossEntropyLoss:0, line 1284 <- wrt source file 2025-12-04T09:17:26.8589819Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CrossEntropyLoss:0 2025-12-04T09:17:26.8590978Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CrossEntropyLoss:1, line 1311 <- wrt source file 2025-12-04T09:17:26.8592950Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CrossEntropyLoss:1 2025-12-04T09:17:26.8594200Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CosineEmbeddingLoss:0, line 1464 <- wrt source file 2025-12-04T09:17:26.8602087Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CosineEmbeddingLoss:0 2025-12-04T09:17:26.8603266Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MarginRankingLoss:0, line 1531 <- wrt source file 2025-12-04T09:17:26.8609460Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MarginRankingLoss:0 2025-12-04T09:17:26.8610592Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MultiMarginLoss:0, line 1612 <- wrt source file 2025-12-04T09:17:26.8618435Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MultiMarginLoss:0 2025-12-04T09:17:26.8619706Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::TripletMarginLoss:0, line 1714 <- wrt source file 2025-12-04T09:17:26.8629049Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::TripletMarginLoss:0 2025-12-04T09:17:26.8630282Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::TripletMarginWithDistanceLoss:0, line 1827 <- wrt source file 2025-12-04T09:17:26.8646735Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::TripletMarginWithDistanceLoss:0 2025-12-04T09:17:26.8647894Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CTCLoss:0, line 1959 <- wrt source file 2025-12-04T09:17:26.8670827Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CTCLoss:0 2025-12-04T09:17:26.8672005Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.register_buffer:0, line 554 <- wrt source file 2025-12-04T09:17:26.8673236Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.register_buffer:0 2025-12-04T09:17:26.8674398Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.apply:0, line 1048 <- wrt source file 2025-12-04T09:17:26.8686622Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.apply:0 2025-12-04T09:17:26.8687769Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.to:0, line 1299 <- wrt source file 2025-12-04T09:17:26.8695984Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.to:0 2025-12-04T09:17:26.8697147Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.state_dict:0, line 2232 <- wrt source file 2025-12-04T09:17:26.8698335Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.state_dict:0 2025-12-04T09:17:26.8699502Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.parameters:0, line 2678 <- wrt source file 2025-12-04T09:17:26.8700820Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.parameters:0 2025-12-04T09:17:26.8702002Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_parameters:0, line 2706 <- wrt source file 2025-12-04T09:17:26.8703337Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_parameters:0 2025-12-04T09:17:26.8704578Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.buffers:0, line 2733 <- wrt source file 2025-12-04T09:17:26.8705721Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.buffers:0 2025-12-04T09:17:26.8706869Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_buffers:0, line 2760 <- wrt source file 2025-12-04T09:17:26.8708067Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_buffers:0 2025-12-04T09:17:26.8709238Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_children:0, line 2791 <- wrt source file 2025-12-04T09:17:26.8710450Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_children:0 2025-12-04T09:17:26.8711671Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.modules:0, line 2815 <- wrt source file 2025-12-04T09:17:26.8712819Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.modules:0 2025-12-04T09:17:26.8713958Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_modules:0, line 2853 <- wrt source file 2025-12-04T09:17:26.8715152Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_modules:0 2025-12-04T09:17:26.8716378Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::LocalResponseNorm:0, line 38 <- wrt source file 2025-12-04T09:17:26.8740329Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::LocalResponseNorm:0 2025-12-04T09:17:26.8741564Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::LayerNorm:0, line 163 <- wrt source file 2025-12-04T09:17:26.8749839Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::LayerNorm:0 2025-12-04T09:17:26.8751028Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::GroupNorm:0, line 274 <- wrt source file 2025-12-04T09:17:26.8757920Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::GroupNorm:0 2025-12-04T09:17:26.8759074Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::RMSNorm:0, line 369 <- wrt source file 2025-12-04T09:17:26.8763927Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::RMSNorm:0 2025-12-04T09:17:26.8765062Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad1d:0, line 70 <- wrt source file 2025-12-04T09:17:26.8771228Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad1d:0 2025-12-04T09:17:26.8772632Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad2d:0, line 123 <- wrt source file 2025-12-04T09:17:26.8792011Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad2d:0 2025-12-04T09:17:26.8793211Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad3d:0, line 189 <- wrt source file 2025-12-04T09:17:27.3631168Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad3d:0 2025-12-04T09:17:27.4017126Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad1d:0, line 244 <- wrt source file 2025-12-04T09:17:27.4029159Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad1d:0 2025-12-04T09:17:27.4030513Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad2d:0, line 298 <- wrt source file 2025-12-04T09:17:27.4036607Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad2d:0 2025-12-04T09:17:27.4038070Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad3d:0, line 355 <- wrt source file 2025-12-04T09:17:27.4055765Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad3d:0 2025-12-04T09:17:27.4057603Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad1d:0, line 401 <- wrt source file 2025-12-04T09:17:27.4063135Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad1d:0 2025-12-04T09:17:27.4064388Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad2d:0, line 446 <- wrt source file 2025-12-04T09:17:27.4070582Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad2d:0 2025-12-04T09:17:27.4071773Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad3d:0, line 505 <- wrt source file 2025-12-04T09:17:27.4075137Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad3d:0 2025-12-04T09:17:27.4076311Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad1d:0, line 565 <- wrt source file 2025-12-04T09:17:27.4081572Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad1d:0 2025-12-04T09:17:27.4082776Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad2d:0, line 610 <- wrt source file 2025-12-04T09:17:27.4088567Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad2d:0 2025-12-04T09:17:27.4089755Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad3d:0, line 669 <- wrt source file 2025-12-04T09:17:27.7711064Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad3d:0 2025-12-04T09:17:27.8090331Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad1d:0, line 704 <- wrt source file 2025-12-04T09:17:27.8101502Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad1d:0 2025-12-04T09:17:27.8103126Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad2d:0, line 762 <- wrt source file 2025-12-04T09:17:27.8107889Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad2d:0 2025-12-04T09:17:27.8109164Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad3d:0, line 824 <- wrt source file 2025-12-04T09:17:27.8126206Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad3d:0 2025-12-04T09:17:27.8127816Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pixelshuffle.py::PixelShuffle:0, line 40 <- wrt source file 2025-12-04T09:17:27.8132111Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pixelshuffle.py::PixelShuffle:0 2025-12-04T09:17:27.8133540Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pixelshuffle.py::PixelUnshuffle:0, line 99 <- wrt source file 2025-12-04T09:17:27.8137831Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pixelshuffle.py::PixelUnshuffle:0 2025-12-04T09:17:27.8139032Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool1d:0, line 129 <- wrt source file 2025-12-04T09:17:27.8143220Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool1d:0 2025-12-04T09:17:27.8144464Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool2d:0, line 207 <- wrt source file 2025-12-04T09:17:27.8174337Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool2d:0 2025-12-04T09:17:27.8175458Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool3d:0, line 291 <- wrt source file 2025-12-04T09:17:27.9610442Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool3d:0 2025-12-04T09:17:27.9676046Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool1d:0, line 366 <- wrt source file 2025-12-04T09:17:27.9691224Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool1d:0 2025-12-04T09:17:27.9692522Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool2d:0, line 452 <- wrt source file 2025-12-04T09:17:27.9712744Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool2d:0 2025-12-04T09:17:27.9713894Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool3d:0, line 550 <- wrt source file 2025-12-04T09:17:28.0242638Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool3d:0 2025-12-04T09:17:28.0243800Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool1d:0, line 642 <- wrt source file 2025-12-04T09:17:28.0254250Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool1d:0 2025-12-04T09:17:28.0255409Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool2d:0, line 738 <- wrt source file 2025-12-04T09:17:28.0280878Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool2d:0 2025-12-04T09:17:28.0282016Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool3d:0, line 855 <- wrt source file 2025-12-04T09:17:28.1423755Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool3d:0 2025-12-04T09:17:28.1491110Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::FractionalMaxPool2d:0, line 946 <- wrt source file 2025-12-04T09:17:28.1524722Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::FractionalMaxPool2d:0 2025-12-04T09:17:28.1526293Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::FractionalMaxPool3d:0, line 1033 <- wrt source file 2025-12-04T09:17:28.1902667Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::FractionalMaxPool3d:0 2025-12-04T09:17:28.1903874Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool1d:0, line 1156 <- wrt source file 2025-12-04T09:17:28.1910912Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool1d:0 2025-12-04T09:17:28.1912031Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool2d:0, line 1212 <- wrt source file 2025-12-04T09:17:28.1944165Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool2d:0 2025-12-04T09:17:28.1945536Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool3d:0, line 1276 <- wrt source file 2025-12-04T09:17:28.3416305Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool3d:0 2025-12-04T09:17:28.3482575Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool1d:0, line 1332 <- wrt source file 2025-12-04T09:17:28.3491195Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool1d:0 2025-12-04T09:17:28.3492831Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool2d:0, line 1367 <- wrt source file 2025-12-04T09:17:28.3500167Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool2d:0 2025-12-04T09:17:28.3501448Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool3d:0, line 1411 <- wrt source file 2025-12-04T09:17:28.3591703Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool3d:0 2025-12-04T09:17:28.3592915Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool1d:0, line 1459 <- wrt source file 2025-12-04T09:17:28.3597012Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool1d:0 2025-12-04T09:17:28.3598243Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool2d:0, line 1493 <- wrt source file 2025-12-04T09:17:28.3604486Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool2d:0 2025-12-04T09:17:28.3605712Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool3d:0, line 1533 <- wrt source file 2025-12-04T09:17:28.3617758Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool3d:0 2025-12-04T09:17:28.3618861Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::RNN:0, line 598 <- wrt source file 2025-12-04T09:17:28.3629718Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::RNN:0 2025-12-04T09:17:28.3630775Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::LSTM:0, line 963 <- wrt source file 2025-12-04T09:17:28.3785292Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::LSTM:0 2025-12-04T09:17:28.3786772Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::GRU:0, line 1305 <- wrt source file 2025-12-04T09:17:28.3799988Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::GRU:0 2025-12-04T09:17:28.3801006Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::RNNCell:0, line 1561 <- wrt source file 2025-12-04T09:17:28.3810123Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::RNNCell:0 2025-12-04T09:17:28.3811196Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::LSTMCell:0, line 1683 <- wrt source file 2025-12-04T09:17:28.3819931Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::LSTMCell:0 2025-12-04T09:17:28.3820989Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::GRUCell:0, line 1797 <- wrt source file 2025-12-04T09:17:28.3831688Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::GRUCell:0 2025-12-04T09:17:28.3832785Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::Embedding:0, line 71 <- wrt source file 2025-12-04T09:17:28.3846053Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::Embedding:0 2025-12-04T09:17:28.3847239Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::Embedding.from_pretrained:0, line 243 <- wrt source file 2025-12-04T09:17:28.3852040Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::Embedding.from_pretrained:0 2025-12-04T09:17:28.3853342Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::EmbeddingBag:0, line 324 <- wrt source file 2025-12-04T09:17:28.3867165Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::EmbeddingBag:0 2025-12-04T09:17:28.3868393Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::EmbeddingBag.from_pretrained:0, line 523 <- wrt source file 2025-12-04T09:17:28.3874070Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::EmbeddingBag.from_pretrained:0 2025-12-04T09:17:28.3875288Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::Transformer:0, line 91 <- wrt source file 2025-12-04T09:17:29.0201519Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::Transformer:0 2025-12-04T09:17:29.0214199Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::Transformer.forward:0, line 267 <- wrt source file 2025-12-04T09:17:29.0215567Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::Transformer.forward:0 2025-12-04T09:17:29.0216819Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerEncoder:0, line 345 <- wrt source file 2025-12-04T09:17:29.1075800Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerEncoder:0 2025-12-04T09:17:29.1079150Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerDecoder:0, line 578 <- wrt source file 2025-12-04T09:17:29.2975052Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerDecoder:0 2025-12-04T09:17:29.2981827Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerEncoderLayer:0, line 702 <- wrt source file 2025-12-04T09:17:29.3212978Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerEncoderLayer:0 2025-12-04T09:17:29.3214378Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerDecoderLayer:0, line 1014 <- wrt source file 2025-12-04T09:17:29.3647295Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerDecoderLayer:0 2025-12-04T09:17:29.3648551Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::Upsample:0, line 77 <- wrt source file 2025-12-04T09:17:29.3674626Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::Upsample:0 2025-12-04T09:17:29.3676197Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::UpsamplingNearest2d:0, line 229 <- wrt source file 2025-12-04T09:17:29.3687326Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::UpsamplingNearest2d:0 2025-12-04T09:17:29.3688648Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::UpsamplingBilinear2d:0, line 279 <- wrt source file 2025-12-04T09:17:29.3696568Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::UpsamplingBilinear2d:0 2025-12-04T09:17:29.3697652Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py::DataParallel:0, line 128 <- wrt source file 2025-12-04T09:17:29.3698609Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py::DataParallel:0 2025-12-04T09:17:29.3699597Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel:0, line 644 <- wrt source file 2025-12-04T09:17:29.3700587Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel:0 2025-12-04T09:17:29.3701586Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.no_sync:0, line 1451 <- wrt source file 2025-12-04T09:17:29.3702623Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.no_sync:0 2025-12-04T09:17:29.3703619Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.join:0, line 1838 <- wrt source file 2025-12-04T09:17:29.3704639Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.join:0 2025-12-04T09:17:29.3705684Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.register_comm_hook:0, line 2004 <- wrt source file 2025-12-04T09:17:29.3706780Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.register_comm_hook:0 2025-12-04T09:17:29.3708046Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.register_comm_hook:1, line 2014 <- wrt source file 2025-12-04T09:17:29.3709135Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.register_comm_hook:1 2025-12-04T09:17:29.3710340Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel._register_builtin_comm_hook:0, line 2049 <- wrt source file 2025-12-04T09:17:29.3711564Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel._register_builtin_comm_hook:0 2025-12-04T09:17:29.3712669Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel._register_fused_optim:0, line 2107 <- wrt source file 2025-12-04T09:17:29.3713765Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel._register_fused_optim:0 2025-12-04T09:17:29.3714755Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_per_sample_grad.py::call_for_per_sample_grads:0, line 35 <- wrt source file 2025-12-04T09:17:29.3715691Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_per_sample_grad.py::call_for_per_sample_grads:0 2025-12-04T09:17:29.3716576Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/init.py::skip_init:0, line 33 <- wrt source file 2025-12-04T09:17:29.3717355Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/init.py::skip_init:0 2025-12-04T09:17:29.3718222Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/memory_format.py::convert_conv2d_weight_memory_format:0, line 64 <- wrt source file 2025-12-04T09:17:29.3719281Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/memory_format.py::convert_conv2d_weight_memory_format:0 2025-12-04T09:17:29.3720284Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/memory_format.py::convert_conv3d_weight_memory_format:0, line 143 <- wrt source file 2025-12-04T09:17:29.3721308Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/memory_format.py::convert_conv3d_weight_memory_format:0 2025-12-04T09:17:29.3722242Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::orthogonal:0, line 267 <- wrt source file 2025-12-04T09:17:29.3723130Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::orthogonal:0 2025-12-04T09:17:29.3723997Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::weight_norm:0, line 362 <- wrt source file 2025-12-04T09:17:29.3729785Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::weight_norm:0 2025-12-04T09:17:29.3730777Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::spectral_norm:0, line 593 <- wrt source file 2025-12-04T09:17:29.3731963Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::spectral_norm:0 2025-12-04T09:17:29.3733108Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::identity:0, line 852 <- wrt source file 2025-12-04T09:17:29.3734089Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::identity:0 2025-12-04T09:17:29.3735171Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::random_unstructured:0, line 888 <- wrt source file 2025-12-04T09:17:29.3736240Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::random_unstructured:0 2025-12-04T09:17:29.3737250Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::l1_unstructured:0, line 931 <- wrt source file 2025-12-04T09:17:29.3738342Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::l1_unstructured:0 2025-12-04T09:17:29.3739411Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::random_structured:0, line 971 <- wrt source file 2025-12-04T09:17:29.3740451Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::random_structured:0 2025-12-04T09:17:29.3741367Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::ln_structured:0, line 1017 <- wrt source file 2025-12-04T09:17:29.3748176Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::ln_structured:0 2025-12-04T09:17:29.3749049Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::global_unstructured:0, line 1072 <- wrt source file 2025-12-04T09:17:29.3764605Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::global_unstructured:0 2025-12-04T09:17:29.3765670Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::custom_from_mask:0, line 1175 <- wrt source file 2025-12-04T09:17:29.3773849Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::custom_from_mask:0 2025-12-04T09:17:29.3774877Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::remove:0, line 1203 <- wrt source file 2025-12-04T09:17:29.3780454Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::remove:0 2025-12-04T09:17:29.3781458Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::is_pruned:0, line 1231 <- wrt source file 2025-12-04T09:17:29.3788864Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::is_pruned:0 2025-12-04T09:17:29.3789891Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pad_packed_sequence:0, line 359 <- wrt source file 2025-12-04T09:17:29.3804712Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pad_packed_sequence:0 2025-12-04T09:17:29.3805732Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pad_sequence:0, line 439 <- wrt source file 2025-12-04T09:17:29.3810636Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pad_sequence:0 2025-12-04T09:17:29.3811638Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::unpad_sequence:0, line 500 <- wrt source file 2025-12-04T09:17:29.3823798Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::unpad_sequence:0 2025-12-04T09:17:29.3824895Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pack_sequence:0, line 556 <- wrt source file 2025-12-04T09:17:29.3831659Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pack_sequence:0 2025-12-04T09:17:29.3832760Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::unpack_sequence:0, line 584 <- wrt source file 2025-12-04T09:17:29.3849702Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::unpack_sequence:0 2025-12-04T09:17:29.3850831Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/spectral_norm.py::spectral_norm:0, line 314 <- wrt source file 2025-12-04T09:17:29.3857239Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/spectral_norm.py::spectral_norm:0 2025-12-04T09:17:29.3858631Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/spectral_norm.py::remove_spectral_norm:0, line 347 <- wrt source file 2025-12-04T09:17:29.3864015Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/spectral_norm.py::remove_spectral_norm:0 2025-12-04T09:17:29.3865227Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/stateless.py::functional_call:0, line 193 <- wrt source file 2025-12-04T09:17:29.3868496Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/stateless.py::functional_call:0 2025-12-04T09:17:29.3869648Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py::weight_norm:0, line 134 <- wrt source file 2025-12-04T09:17:29.3875974Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py::weight_norm:0 2025-12-04T09:17:29.3877233Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py::remove_weight_norm:0, line 156 <- wrt source file 2025-12-04T09:17:29.3881631Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py::remove_weight_norm:0 2025-12-04T09:17:29.3882862Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_expanded_weights/conv_utils.py::unfold3d:0, line 315 <- wrt source file 2025-12-04T09:17:29.3884145Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_expanded_weights/conv_utils.py::unfold3d:0 2025-12-04T09:17:29.3885537Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_expanded_weights/expanded_weights_utils.py::sum_over_all_but_batch_and_last_n:0, line 178 <- wrt source file 2025-12-04T09:17:29.3888513Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_expanded_weights/expanded_weights_utils.py::sum_over_all_but_batch_and_last_n:0 2025-12-04T09:17:29.3889838Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::LambdaLR:0, line 357 <- wrt source file 2025-12-04T09:17:29.3890941Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::LambdaLR:0 2025-12-04T09:17:29.3892058Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::MultiplicativeLR:0, line 483 <- wrt source file 2025-12-04T09:17:29.3893326Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::MultiplicativeLR:0 2025-12-04T09:17:29.3894422Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::StepLR:0, line 608 <- wrt source file 2025-12-04T09:17:29.3895504Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::StepLR:0 2025-12-04T09:17:29.3896575Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::MultiStepLR:0, line 695 <- wrt source file 2025-12-04T09:17:29.3897697Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::MultiStepLR:0 2025-12-04T09:17:29.3898888Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ConstantLR:0, line 791 <- wrt source file 2025-12-04T09:17:29.3899995Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ConstantLR:0 2025-12-04T09:17:29.3901055Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::LinearLR:0, line 898 <- wrt source file 2025-12-04T09:17:29.3902220Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::LinearLR:0 2025-12-04T09:17:29.3903418Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ExponentialLR:0, line 1020 <- wrt source file 2025-12-04T09:17:29.3904578Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ExponentialLR:0 2025-12-04T09:17:29.3905688Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::SequentialLR:0, line 1097 <- wrt source file 2025-12-04T09:17:29.3906821Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::SequentialLR:0 2025-12-04T09:17:29.3907924Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::PolynomialLR:0, line 1249 <- wrt source file 2025-12-04T09:17:29.3909117Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::PolynomialLR:0 2025-12-04T09:17:29.3910273Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingLR:0, line 1378 <- wrt source file 2025-12-04T09:17:29.3911470Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingLR:0 2025-12-04T09:17:29.3912628Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ChainedScheduler:0, line 1490 <- wrt source file 2025-12-04T09:17:29.3913810Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ChainedScheduler:0 2025-12-04T09:17:29.3914911Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CyclicLR:0, line 1863 <- wrt source file 2025-12-04T09:17:29.3915997Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CyclicLR:0 2025-12-04T09:17:29.3917181Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts:0, line 2129 <- wrt source file 2025-12-04T09:17:29.3918481Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts:0 2025-12-04T09:17:29.3919790Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts.step:0, line 2211 <- wrt source file 2025-12-04T09:17:29.3921160Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts.step:0 2025-12-04T09:17:29.3922501Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts.step:1, line 2227 <- wrt source file 2025-12-04T09:17:29.3923868Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts.step:1 2025-12-04T09:17:29.3925072Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::OneCycleLR:0, line 2367 <- wrt source file 2025-12-04T09:17:29.3926187Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::OneCycleLR:0 2025-12-04T09:17:29.3927395Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/optimizer.py::Optimizer.load_state_dict:0, line 900 <- wrt source file 2025-12-04T09:17:29.3928628Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/optimizer.py::Optimizer.load_state_dict:0 2025-12-04T09:17:29.3929816Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::AveragedModel:0, line 155 <- wrt source file 2025-12-04T09:17:29.3930983Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::AveragedModel:0 2025-12-04T09:17:29.3932057Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::AveragedModel:1, line 181 <- wrt source file 2025-12-04T09:17:29.3933244Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::AveragedModel:1 2025-12-04T09:17:29.3934301Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::update_bn:0, line 350 <- wrt source file 2025-12-04T09:17:29.3935365Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::update_bn:0 2025-12-04T09:17:29.3936381Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::SWALR:0, line 409 <- wrt source file 2025-12-04T09:17:29.3937356Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::SWALR:0 2025-12-04T09:17:29.3938184Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/package/glob_group.py::GlobGroup:0, line 22 <- wrt source file 2025-12-04T09:17:29.3939013Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/package/glob_group.py::GlobGroup:0 2025-12-04T09:17:29.3939928Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/profiler/profiler.py::_KinetoProfile.toggle_collection_dynamic:0, line 317 <- wrt source file 2025-12-04T09:17:29.3940951Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/profiler/profiler.py::_KinetoProfile.toggle_collection_dynamic:0 2025-12-04T09:17:29.3941849Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/profiler/profiler.py::profile:0, line 659 <- wrt source file 2025-12-04T09:17:29.3942645Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/profiler/profiler.py::profile:0 2025-12-04T09:17:29.3943498Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/semi_structured.py::to_sparse_semi_structured:0, line 342 <- wrt source file 2025-12-04T09:17:29.3944432Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/semi_structured.py::to_sparse_semi_structured:0 2025-12-04T09:17:29.3945316Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py::assert_close:0, line 1477 <- wrt source file 2025-12-04T09:17:29.3970492Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py::assert_close:0 2025-12-04T09:17:29.3971534Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_creation.py::make_tensor:0, line 114 <- wrt source file 2025-12-04T09:17:29.3972691Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_creation.py::make_tensor:0 2025-12-04T09:17:29.3973749Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::parametrize:0, line 648 <- wrt source file 2025-12-04T09:17:29.3974888Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::parametrize:0 2025-12-04T09:17:29.3976126Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::reparametrize:0, line 769 <- wrt source file 2025-12-04T09:17:29.3977277Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::reparametrize:0 2025-12-04T09:17:29.3978463Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::decorateIf:0, line 858 <- wrt source file 2025-12-04T09:17:29.3979667Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::decorateIf:0 2025-12-04T09:17:29.3980839Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_symmetric_psd_matrix:0, line 4839 <- wrt source file 2025-12-04T09:17:29.3981878Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_symmetric_psd_matrix:0 2025-12-04T09:17:29.3982859Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_hermitian_psd_matrix:0, line 4853 <- wrt source file 2025-12-04T09:17:29.3983869Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_hermitian_psd_matrix:0 2025-12-04T09:17:29.3984902Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_hermitian_pd_matrix:0, line 4883 <- wrt source file 2025-12-04T09:17:29.3985908Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_hermitian_pd_matrix:0 2025-12-04T09:17:29.3986838Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/logging_utils.py::logs_to_string:0, line 194 <- wrt source file 2025-12-04T09:17:29.3987767Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/logging_utils.py::logs_to_string:0 2025-12-04T09:17:29.3988719Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/logging_utils.py::multiple_logs_to_string:0, line 220 <- wrt source file 2025-12-04T09:17:29.3989703Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/logging_utils.py::multiple_logs_to_string:0 2025-12-04T09:17:29.3990742Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py::skip_unless_torch_gpu:0, line 341 <- wrt source file 2025-12-04T09:17:29.3991857Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py::skip_unless_torch_gpu:0 2025-12-04T09:17:29.3992958Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/optests/autograd_registration.py::autograd_registration_check:0, line 29 <- wrt source file 2025-12-04T09:17:29.4000339Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/optests/autograd_registration.py::autograd_registration_check:0 2025-12-04T09:17:29.4001575Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::register_pytree_node:0, line 159 <- wrt source file 2025-12-04T09:17:29.4002665Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::register_pytree_node:0 2025-12-04T09:17:29.4003703Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_is_leaf:0, line 316 <- wrt source file 2025-12-04T09:17:29.4008599Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_is_leaf:0 2025-12-04T09:17:29.4009625Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_flatten:0, line 359 <- wrt source file 2025-12-04T09:17:29.4016230Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_flatten:0 2025-12-04T09:17:29.4017163Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_unflatten:0, line 396 <- wrt source file 2025-12-04T09:17:29.4021066Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_unflatten:0 2025-12-04T09:17:29.4022139Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_iter:0, line 429 <- wrt source file 2025-12-04T09:17:29.4027551Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_iter:0 2025-12-04T09:17:29.4028567Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_leaves:0, line 464 <- wrt source file 2025-12-04T09:17:29.4032323Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_leaves:0 2025-12-04T09:17:29.4033168Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_structure:0, line 499 <- wrt source file 2025-12-04T09:17:29.4037514Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_structure:0 2025-12-04T09:17:29.4038343Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_map:0, line 536 <- wrt source file 2025-12-04T09:17:29.4043629Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_map:0 2025-12-04T09:17:29.4044475Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::broadcast_prefix:0, line 929 <- wrt source file 2025-12-04T09:17:29.4053174Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::broadcast_prefix:0 2025-12-04T09:17:29.4054255Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::register_dataclass:0, line 308 <- wrt source file 2025-12-04T09:17:29.4064280Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::register_dataclass:0 2025-12-04T09:17:29.4065334Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::register_constant:0, line 428 <- wrt source file 2025-12-04T09:17:29.4073407Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::register_constant:0 2025-12-04T09:17:29.4074419Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::tree_is_leaf:0, line 1058 <- wrt source file 2025-12-04T09:17:29.4079938Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::tree_is_leaf:0 2025-12-04T09:17:29.4080920Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::tree_map:0, line 1497 <- wrt source file 2025-12-04T09:17:29.4086360Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::tree_map:0 2025-12-04T09:17:29.4087709Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::rename_privateuse1_backend:0, line 71 <- wrt source file 2025-12-04T09:17:29.4089168Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::rename_privateuse1_backend:0 2025-12-04T09:17:29.4091028Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::generate_methods_for_privateuse1_backend:0, line 382 <- wrt source file 2025-12-04T09:17:29.4092474Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::generate_methods_for_privateuse1_backend:0 2025-12-04T09:17:29.4093597Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::_get_custom_mod_func:0, line 417 <- wrt source file 2025-12-04T09:17:29.4094605Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::_get_custom_mod_func:0 2025-12-04T09:17:29.4095517Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::checkpoint_sequential:0, line 561 <- wrt source file 2025-12-04T09:17:29.4096430Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::checkpoint_sequential:0 2025-12-04T09:17:29.4097305Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::set_checkpoint_early_stop:0, line 763 <- wrt source file 2025-12-04T09:17:29.4098209Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::set_checkpoint_early_stop:0 2025-12-04T09:17:29.4099180Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::SelectiveCheckpointContext:0, line 1257 <- wrt source file 2025-12-04T09:17:29.4100132Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::SelectiveCheckpointContext:0 2025-12-04T09:17:29.4101077Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::create_selective_checkpoint_contexts:0, line 1421 <- wrt source file 2025-12-04T09:17:29.4105338Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::create_selective_checkpoint_contexts:0 2025-12-04T09:17:29.4106611Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CppExtension:0, line 1247 <- wrt source file 2025-12-04T09:17:29.4107755Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CppExtension:0 2025-12-04T09:17:29.4108821Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CUDAExtension:0, line 1319 <- wrt source file 2025-12-04T09:17:29.4109857Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CUDAExtension:0 2025-12-04T09:17:29.4110698Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CUDAExtension:1, line 1397 <- wrt source file 2025-12-04T09:17:29.4111556Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CUDAExtension:1 2025-12-04T09:17:29.4112377Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::SyclExtension:0, line 1509 <- wrt source file 2025-12-04T09:17:29.4113241Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::SyclExtension:0 2025-12-04T09:17:29.4114036Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::load:0, line 1759 <- wrt source file 2025-12-04T09:17:29.4114824Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::load:0 2025-12-04T09:17:29.4115620Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::load_inline:0, line 2032 <- wrt source file 2025-12-04T09:17:29.4116544Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::load_inline:0 2025-12-04T09:17:29.4117323Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/dlpack.py::from_dlpack:0, line 93 <- wrt source file 2025-12-04T09:17:29.4126408Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/dlpack.py::from_dlpack:0 2025-12-04T09:17:29.4127396Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/throughput_benchmark.py::ThroughputBenchmark:0, line 78 <- wrt source file 2025-12-04T09:17:29.4128436Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/throughput_benchmark.py::ThroughputBenchmark:0 2025-12-04T09:17:29.4129403Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_sympy/functions.py::MinMaxBase._collapse_arguments:0, line 742 <- wrt source file 2025-12-04T09:17:29.4539115Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_sympy/functions.py::MinMaxBase._collapse_arguments:0 2025-12-04T09:17:29.4540315Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::IterableDataset:0, line 94 <- wrt source file 2025-12-04T09:17:29.4546513Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::IterableDataset:0 2025-12-04T09:17:29.4548111Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::StackDataset:0, line 218 <- wrt source file 2025-12-04T09:17:29.4549448Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::StackDataset:0 2025-12-04T09:17:29.4550546Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::random_split:0, line 438 <- wrt source file 2025-12-04T09:17:29.4551457Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::random_split:0 2025-12-04T09:17:29.4552354Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/distributed.py::DistributedSampler:0, line 55 <- wrt source file 2025-12-04T09:17:29.4553310Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/distributed.py::DistributedSampler:0 2025-12-04T09:17:29.4554175Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::Sampler:0, line 36 <- wrt source file 2025-12-04T09:17:29.4555020Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::Sampler:0 2025-12-04T09:17:29.4556048Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::WeightedRandomSampler:0, line 225 <- wrt source file 2025-12-04T09:17:29.4557009Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::WeightedRandomSampler:0 2025-12-04T09:17:29.4557889Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::BatchSampler:0, line 296 <- wrt source file 2025-12-04T09:17:29.4562507Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::BatchSampler:0 2025-12-04T09:17:29.4563422Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::default_convert:0, line 39 <- wrt source file 2025-12-04T09:17:29.4565673Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::default_convert:0 2025-12-04T09:17:29.4566678Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::collate:0, line 137 <- wrt source file 2025-12-04T09:17:29.4570646Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::collate:0 2025-12-04T09:17:29.4571731Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::default_collate:0, line 367 <- wrt source file 2025-12-04T09:17:29.4576705Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::default_collate:0 2025-12-04T09:17:29.4577956Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py::IterDataPipe:0, line 97 <- wrt source file 2025-12-04T09:17:29.4579415Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py::IterDataPipe:0 2025-12-04T09:17:29.4580756Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py::MapDataPipe:0, line 269 <- wrt source file 2025-12-04T09:17:29.4581738Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py::MapDataPipe:0 2025-12-04T09:17:29.4582857Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py::MapperIterDataPipe:0, line 53 <- wrt source file 2025-12-04T09:17:29.4583985Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py::MapperIterDataPipe:0 2025-12-04T09:17:29.4585002Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py::CollatorIterDataPipe:0, line 202 <- wrt source file 2025-12-04T09:17:29.4589682Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py::CollatorIterDataPipe:0 2025-12-04T09:17:29.4590858Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combinatorics.py::ShufflerIterDataPipe:0, line 90 <- wrt source file 2025-12-04T09:17:29.4592267Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combinatorics.py::ShufflerIterDataPipe:0 2025-12-04T09:17:29.4593569Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ConcaterIterDataPipe:0, line 38 <- wrt source file 2025-12-04T09:17:29.4618959Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ConcaterIterDataPipe:0 2025-12-04T09:17:29.4620651Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ForkerIterDataPipe:0, line 89 <- wrt source file 2025-12-04T09:17:29.4622026Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ForkerIterDataPipe:0 2025-12-04T09:17:29.4623466Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::_ChildDataPipe:0, line 308 <- wrt source file 2025-12-04T09:17:29.4624960Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::_ChildDataPipe:0 2025-12-04T09:17:29.4626498Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::DemultiplexerIterDataPipe:0, line 397 <- wrt source file 2025-12-04T09:17:29.4628057Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::DemultiplexerIterDataPipe:0 2025-12-04T09:17:29.4630566Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::MultiplexerIterDataPipe:0, line 615 <- wrt source file 2025-12-04T09:17:29.4631856Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::MultiplexerIterDataPipe:0 2025-12-04T09:17:29.4632888Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ZipperIterDataPipe:0, line 685 <- wrt source file 2025-12-04T09:17:29.4633976Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ZipperIterDataPipe:0 2025-12-04T09:17:29.4635128Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/filelister.py::FileListerIterDataPipe:0, line 29 <- wrt source file 2025-12-04T09:17:29.4636200Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/filelister.py::FileListerIterDataPipe:0 2025-12-04T09:17:29.4637437Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/fileopener.py::FileOpenerIterDataPipe:0, line 33 <- wrt source file 2025-12-04T09:17:29.4638683Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/fileopener.py::FileOpenerIterDataPipe:0 2025-12-04T09:17:29.4639939Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::BatcherIterDataPipe:0, line 41 <- wrt source file 2025-12-04T09:17:29.4641197Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::BatcherIterDataPipe:0 2025-12-04T09:17:29.4642223Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::UnBatcherIterDataPipe:0, line 102 <- wrt source file 2025-12-04T09:17:29.4643440Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::UnBatcherIterDataPipe:0 2025-12-04T09:17:29.4644591Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::GrouperIterDataPipe:0, line 169 <- wrt source file 2025-12-04T09:17:29.4645767Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::GrouperIterDataPipe:0 2025-12-04T09:17:29.4646921Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/selecting.py::FilterIterDataPipe:0, line 37 <- wrt source file 2025-12-04T09:17:29.4648108Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/selecting.py::FilterIterDataPipe:0 2025-12-04T09:17:29.4649164Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/streamreader.py::StreamReaderIterDataPipe:0, line 24 <- wrt source file 2025-12-04T09:17:29.4650557Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/streamreader.py::StreamReaderIterDataPipe:0 2025-12-04T09:17:29.4651651Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/utils.py::IterableWrapperIterDataPipe:0, line 29 <- wrt source file 2025-12-04T09:17:29.4652809Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/utils.py::IterableWrapperIterDataPipe:0 2025-12-04T09:17:29.4653834Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/callable.py::MapperMapDataPipe:0, line 36 <- wrt source file 2025-12-04T09:17:29.4654930Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/callable.py::MapperMapDataPipe:0 2025-12-04T09:17:29.4656146Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combinatorics.py::ShufflerIterDataPipe:0, line 34 <- wrt source file 2025-12-04T09:17:29.4657221Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combinatorics.py::ShufflerIterDataPipe:0 2025-12-04T09:17:29.4658301Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combining.py::ConcaterMapDataPipe:0, line 29 <- wrt source file 2025-12-04T09:17:29.4659378Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combining.py::ConcaterMapDataPipe:0 2025-12-04T09:17:29.4660383Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combining.py::ZipperMapDataPipe:0, line 76 <- wrt source file 2025-12-04T09:17:29.4661402Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combining.py::ZipperMapDataPipe:0 2025-12-04T09:17:29.4662739Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/grouping.py::BatcherMapDataPipe:0, line 29 <- wrt source file 2025-12-04T09:17:29.4663815Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/grouping.py::BatcherMapDataPipe:0 2025-12-04T09:17:29.4664925Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/utils.py::SequenceWrapperMapDataPipe:0, line 29 <- wrt source file 2025-12-04T09:17:29.4665996Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/utils.py::SequenceWrapperMapDataPipe:0 2025-12-04T09:17:29.4667010Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/common.py::validate_input_col:0, line 37 <- wrt source file 2025-12-04T09:17:29.4668005Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/common.py::validate_input_col:0 2025-12-04T09:17:29.4668990Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/decoder.py::basichandlers:0, line 47 <- wrt source file 2025-12-04T09:17:29.4669971Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/decoder.py::basichandlers:0 2025-12-04T09:17:29.4670905Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py::find_closure_group:0, line 439 <- wrt source file 2025-12-04T09:17:30.0060468Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py::find_closure_group:0 2025-12-04T09:17:30.0062678Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py::replace_extern_shared:0, line 535 <- wrt source file 2025-12-04T09:17:30.0064698Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py::replace_extern_shared:0 2025-12-04T09:17:30.0066631Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.__init__:0, line 217 <- wrt source file 2025-12-04T09:17:30.0068635Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.__init__:0 2025-12-04T09:17:30.0070666Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_hparams:0, line 322 <- wrt source file 2025-12-04T09:17:30.0072517Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_hparams:0 2025-12-04T09:17:30.0073871Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_scalar:0, line 370 <- wrt source file 2025-12-04T09:17:30.0075347Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_scalar:0 2025-12-04T09:17:30.0076912Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_scalars:0, line 402 <- wrt source file 2025-12-04T09:17:30.0078467Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_scalars:0 2025-12-04T09:17:30.0079874Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_tensor:0, line 450 <- wrt source file 2025-12-04T09:17:30.0081329Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_tensor:0 2025-12-04T09:17:30.0082454Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_histogram:0, line 489 <- wrt source file 2025-12-04T09:17:30.0083665Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_histogram:0 2025-12-04T09:17:30.0085021Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_histogram_raw:0, line 542 <- wrt source file 2025-12-04T09:17:30.0086242Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_histogram_raw:0 2025-12-04T09:17:30.0100106Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_image:0, line 608 <- wrt source file 2025-12-04T09:17:30.0101447Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_image:0 2025-12-04T09:17:30.0103071Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_images:0, line 657 <- wrt source file 2025-12-04T09:17:30.0104219Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_images:0 2025-12-04T09:17:30.0105183Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_text:0, line 820 <- wrt source file 2025-12-04T09:17:30.0106499Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_text:0 2025-12-04T09:17:30.0107527Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_embedding:0, line 887 <- wrt source file 2025-12-04T09:17:30.0109060Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_embedding:0 2025-12-04T09:17:30.0110635Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_pr_curve:0, line 998 <- wrt source file 2025-12-04T09:17:30.0112027Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_pr_curve:0 2025-12-04T09:17:30.0113657Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars_multilinechart:0, line 1072 <- wrt source file 2025-12-04T09:17:30.0115188Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars_multilinechart:0 2025-12-04T09:17:30.0116313Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars_marginchart:0, line 1093 <- wrt source file 2025-12-04T09:17:30.0117504Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars_marginchart:0 2025-12-04T09:17:30.0118614Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars:0, line 1118 <- wrt source file 2025-12-04T09:17:30.0119663Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars:0 2025-12-04T09:17:30.0120656Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_mesh:0, line 1164 <- wrt source file 2025-12-04T09:17:30.0121767Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_mesh:0 2025-12-04T09:17:30.0122289Z ============ 2025-12-04T09:17:30.0122479Z Finished doctests 2025-12-04T09:17:30.0122639Z 378 / 894 passed 2025-12-04T09:17:30.0122808Z  2025-12-04T09:17:30.0123023Z === Found 17 parse-time warnings === 2025-12-04T09:17:30.0123374Z --- Parse Warning: 1 / 17 --- 2025-12-04T09:17:30.0124102Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=Library.fallback in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=368. 2025-12-04T09:17:30.0124911Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0125322Z Registers the function implementation as the fallback for the given key. 2025-12-04T09:17:30.0125629Z 2025-12-04T09:17:30.0125852Z This function only works for a library with global namespace ("_"). 2025-12-04T09:17:30.0126130Z 2025-12-04T09:17:30.0126263Z Args: 2025-12-04T09:17:30.0126544Z fn: function used as fallback for the given dispatch key or :func:`~fallthrough_kernel` 2025-12-04T09:17:30.0126895Z to register a fallthrough. 2025-12-04T09:17:30.0127271Z dispatch_key: dispatch key that the input function should be registered for. By default, it uses 2025-12-04T09:17:30.0127681Z the dispatch key that the library was created with. 2025-12-04T09:17:30.0128126Z with_keyset: flag controlling if the current dispatcher call keyset should be passed as the first argument 2025-12-04T09:17:30.0128673Z to :attr:`fn` when calling. This should be used to create the appropriate keyset for redispatch calls. 2025-12-04T09:17:30.0129024Z 2025-12-04T09:17:30.0129161Z Example:: 2025-12-04T09:17:30.0129322Z 2025-12-04T09:17:30.0129475Z >>> my_lib = Library("_", "IMPL") 2025-12-04T09:17:30.0129717Z >>> def fallback_kernel(op, *args, **kwargs): 2025-12-04T09:17:30.0129973Z >>> # Handle all autocast ops generically 2025-12-04T09:17:30.0130194Z >>> # ... 2025-12-04T09:17:30.0130412Z >>> my_lib.fallback(fallback_kernel, "Autocast") 2025-12-04T09:17:30.0130640Z 2025-12-04T09:17:30.0131155Z Original Error: IndentationError('expected an indented block after function definition on line 2', ('', 5, 1, 'my_lib.fallback(fallback_kernel, "Autocast")\n', 5, 7)) 2025-12-04T09:17:30.0131697Z 2025-12-04T09:17:30.0131857Z my_lib.fallback(fallback_kernel, "Autocast") 2025-12-04T09:17:30.0132128Z ^ 2025-12-04T09:17:30.0132270Z warnings.warn(msg) 2025-12-04T09:17:30.0132595Z 2025-12-04T09:17:30.0132802Z --- Parse Warning: 2 / 17 --- 2025-12-04T09:17:30.0133518Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=register_fake in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=958. 2025-12-04T09:17:30.0134312Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0134767Z Register a FakeTensor implementation ("fake impl") for this operator. 2025-12-04T09:17:30.0135094Z 2025-12-04T09:17:30.0135294Z Also sometimes known as a "meta kernel", "abstract impl". 2025-12-04T09:17:30.0135540Z 2025-12-04T09:17:30.0135778Z An "FakeTensor implementation" specifies the behavior of this operator on 2025-12-04T09:17:30.0136185Z Tensors that carry no data ("FakeTensor"). Given some input Tensors with 2025-12-04T09:17:30.0136569Z certain properties (sizes/strides/storage_offset/device), it specifies 2025-12-04T09:17:30.0136897Z what the properties of the output Tensors are. 2025-12-04T09:17:30.0137121Z 2025-12-04T09:17:30.0137353Z The FakeTensor implementation has the same signature as the operator. 2025-12-04T09:17:30.0137718Z It is run for both FakeTensors and meta tensors. To write a FakeTensor 2025-12-04T09:17:30.0138084Z implementation, assume that all Tensor inputs to the operator are 2025-12-04T09:17:30.0138498Z regular CPU/CUDA/Meta tensors, but they do not have storage, and 2025-12-04T09:17:30.0138856Z you are trying to return regular CPU/CUDA/Meta tensor(s) as output. 2025-12-04T09:17:30.0139243Z The FakeTensor implementation must consist of only PyTorch operations 2025-12-04T09:17:30.0139603Z (and may not directly access the storage or data of any input or 2025-12-04T09:17:30.0139883Z intermediate Tensors). 2025-12-04T09:17:30.0140058Z 2025-12-04T09:17:30.0140243Z This API may be used as a decorator (see examples). 2025-12-04T09:17:30.0140476Z 2025-12-04T09:17:30.0140642Z For a detailed guide on custom ops, please see 2025-12-04T09:17:30.0140979Z https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html 2025-12-04T09:17:30.0141276Z 2025-12-04T09:17:30.0141402Z Args: 2025-12-04T09:17:30.0141646Z op_name: Operator name (along with the overload) or OpOverload object. 2025-12-04T09:17:30.0141957Z func: Fake tensor implementation. 2025-12-04T09:17:30.0142248Z lib (Optional[Library]): Library to register the fake tensor to. 2025-12-04T09:17:30.0142578Z allow_override: Flag controlling if we want to override an 2025-12-04T09:17:30.0142904Z existing registered fake impl. This is by default off, 2025-12-04T09:17:30.0143221Z and will error you're trying to register a fake impl to 2025-12-04T09:17:30.0143554Z an operator that already has a fake impl. This also only 2025-12-04T09:17:30.0143866Z applies if the custom operator was not created via 2025-12-04T09:17:30.0144188Z torch.library.custom_op, as overriding and existing fake 2025-12-04T09:17:30.0144481Z impl is already allowed. 2025-12-04T09:17:30.0144680Z 2025-12-04T09:17:30.0144815Z Examples: 2025-12-04T09:17:30.0144979Z >>> import torch 2025-12-04T09:17:30.0145170Z >>> import numpy as np 2025-12-04T09:17:30.0145383Z >>> from torch import Tensor 2025-12-04T09:17:30.0145582Z >>> 2025-12-04T09:17:30.0145795Z >>> # Example 1: an operator without data-dependent output shape 2025-12-04T09:17:30.0146160Z >>> @torch.library.custom_op("mylib::custom_linear", mutates_args=()) 2025-12-04T09:17:30.0146582Z >>> def custom_linear(x: Tensor, weight: Tensor, bias: Tensor) -> Tensor: 2025-12-04T09:17:30.0146935Z >>> raise NotImplementedError("Implementation goes here") 2025-12-04T09:17:30.0147187Z >>> 2025-12-04T09:17:30.0147398Z >>> @torch.library.register_fake("mylib::custom_linear") 2025-12-04T09:17:30.0147658Z >>> def _(x, weight, bias): 2025-12-04T09:17:30.0147860Z >>> assert x.dim() == 2 2025-12-04T09:17:30.0148076Z >>> assert weight.dim() == 2 2025-12-04T09:17:30.0148341Z >>> assert bias.dim() == 1 2025-12-04T09:17:30.0148614Z >>> assert x.shape[1] == weight.shape[1] 2025-12-04T09:17:30.0148873Z >>> assert weight.shape[0] == bias.shape[0] 2025-12-04T09:17:30.0149121Z >>> assert x.device == weight.device 2025-12-04T09:17:30.0149335Z >>> 2025-12-04T09:17:30.0149501Z >>> return (x @ weight.t()) + bias 2025-12-04T09:17:30.0149715Z >>> 2025-12-04T09:17:30.0149923Z >>> with torch._subclasses.fake_tensor.FakeTensorMode(): 2025-12-04T09:17:30.0150186Z >>> x = torch.randn(2, 3) 2025-12-04T09:17:30.0150403Z >>> w = torch.randn(3, 3) 2025-12-04T09:17:30.0150613Z >>> b = torch.randn(3) 2025-12-04T09:17:30.0150842Z >>> y = torch.ops.mylib.custom_linear(x, w, b) 2025-12-04T09:17:30.0151078Z >>> 2025-12-04T09:17:30.0151240Z >>> assert y.shape == (2, 3) 2025-12-04T09:17:30.0151433Z >>> 2025-12-04T09:17:30.0151681Z >>> # Example 2: an operator with data-dependent output shape 2025-12-04T09:17:30.0152036Z >>> @torch.library.custom_op("mylib::custom_nonzero", mutates_args=()) 2025-12-04T09:17:30.0152355Z >>> def custom_nonzero(x: Tensor) -> Tensor: 2025-12-04T09:17:30.0152600Z >>> x_np = x.numpy(force=True) 2025-12-04T09:17:30.0152840Z >>> res = np.stack(np.nonzero(x_np), axis=1) 2025-12-04T09:17:30.0153103Z >>> return torch.tensor(res, device=x.device) 2025-12-04T09:17:30.0153318Z >>> 2025-12-04T09:17:30.0153527Z >>> @torch.library.register_fake("mylib::custom_nonzero") 2025-12-04T09:17:30.0153786Z >>> def _(x): 2025-12-04T09:17:30.0153996Z >>> # Number of nonzero-elements is data-dependent. 2025-12-04T09:17:30.0154283Z >>> # Since we cannot peek at the data in an fake impl, 2025-12-04T09:17:30.0154585Z >>> # we use the ctx object to construct a new symint that 2025-12-04T09:17:30.0154867Z >>> # represents the data-dependent size. 2025-12-04T09:17:30.0155114Z >>> ctx = torch.library.get_ctx() 2025-12-04T09:17:30.0155352Z >>> nnz = ctx.new_dynamic_size() 2025-12-04T09:17:30.0155754Z >>> shape = [nnz, x.dim()] 2025-12-04T09:17:30.0156095Z >>> result = x.new_empty(shape, dtype=torch.int64) 2025-12-04T09:17:30.0156352Z >>> return result 2025-12-04T09:17:30.0156536Z >>> 2025-12-04T09:17:30.0156742Z >>> from torch.fx.experimental.proxy_tensor import make_fx 2025-12-04T09:17:30.0156997Z >>> 2025-12-04T09:17:30.0157163Z >>> x = torch.tensor([0, 1, 2, 3, 4, 0]) 2025-12-04T09:17:30.0157480Z >>> trace = make_fx(torch.ops.mylib.custom_nonzero, tracing_mode="symbolic")(x) 2025-12-04T09:17:30.0157798Z >>> trace.print_readable() 2025-12-04T09:17:30.0157992Z >>> 2025-12-04T09:17:30.0158230Z >>> assert torch.allclose(trace(x), torch.ops.mylib.custom_nonzero(x)) 2025-12-04T09:17:30.0158512Z 2025-12-04T09:17:30.0158646Z 2025-12-04T09:17:30.0159081Z Original Error: IndentationError('expected an indented block after function definition on line 37', ('', 38, 1, '_._ = None\n', 38, 2)) 2025-12-04T09:17:30.0159554Z 2025-12-04T09:17:30.0159686Z _._ = None 2025-12-04T09:17:30.0159913Z ^ 2025-12-04T09:17:30.0160057Z warnings.warn(msg) 2025-12-04T09:17:30.0160221Z 2025-12-04T09:17:30.0160459Z --- Parse Warning: 3 / 17 --- 2025-12-04T09:17:30.0161183Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=get_kernel in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=1530. 2025-12-04T09:17:30.0162234Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0162718Z Returns the computed kernel for a given operator and dispatch key. 2025-12-04T09:17:30.0163055Z 2025-12-04T09:17:30.0163273Z This function retrieves the kernel that would be executed for a given 2025-12-04T09:17:30.0163673Z operator and dispatch key combination. The returned SafeKernelFunction 2025-12-04T09:17:30.0164037Z can be used to call the kernel in a boxed fashion. The intended use 2025-12-04T09:17:30.0164385Z case for this function is to retrieve the original kernel for a given 2025-12-04T09:17:30.0164743Z dispatch key and then register another kernel to the same dispatch key 2025-12-04T09:17:30.0165080Z that calls into the original kernel for certain cases. 2025-12-04T09:17:30.0165320Z 2025-12-04T09:17:30.0165444Z Args: 2025-12-04T09:17:30.0165668Z op: Operator name (along with the overload) or OpOverload object 2025-12-04T09:17:30.0166043Z Can be a string (e.g., "aten::add.Tensor"), an OpOverload, or a CustomOpDef. 2025-12-04T09:17:30.0166520Z dispatch_key (str | torch.DispatchKey): The dispatch key to get the kernel for. 2025-12-04T09:17:30.0166916Z Can be a string (e.g., "CPU", "CUDA") or a DispatchKey enum value. 2025-12-04T09:17:30.0167177Z 2025-12-04T09:17:30.0167318Z Returns: 2025-12-04T09:17:30.0167570Z torch._C._SafeKernelFunction: A safe kernel function that can be used to 2025-12-04T09:17:30.0167880Z call the kernel. 2025-12-04T09:17:30.0168064Z 2025-12-04T09:17:30.0168188Z Raises: 2025-12-04T09:17:30.0168378Z RuntimeError: If the operator does not exist. 2025-12-04T09:17:30.0168610Z 2025-12-04T09:17:30.0168743Z Example: 2025-12-04T09:17:30.0168915Z >>> # Get the CPU kernel for torch.add 2025-12-04T09:17:30.0169204Z >>> kernel = torch.library.get_kernel("aten::add.Tensor", "CPU") 2025-12-04T09:17:30.0169471Z >>> 2025-12-04T09:17:30.0169635Z >>> # You can also use DispatchKey enum 2025-12-04T09:17:30.0169970Z >>> kernel = torch.library.get_kernel("aten::add.Tensor", torch.DispatchKey.CPU) 2025-12-04T09:17:30.0170284Z >>> 2025-12-04T09:17:30.0170443Z >>> # Or use an OpOverload directly 2025-12-04T09:17:30.0170747Z >>> kernel = torch.library.get_kernel(torch.ops.aten.add.Tensor, "CPU") 2025-12-04T09:17:30.0171031Z >>> 2025-12-04T09:17:30.0171272Z >>> # Example: Using get_kernel in a custom op with conditional dispatch 2025-12-04T09:17:30.0171577Z >>> # Get the original kernel for torch.sin 2025-12-04T09:17:30.0171881Z >>> original_sin_kernel = torch.library.get_kernel("aten::sin", "CPU") 2025-12-04T09:17:30.0172154Z >>> 2025-12-04T09:17:30.0172512Z >>> # If input has negative values, use original sin, otherwise return zeros 2025-12-04T09:17:30.0172844Z >>> def conditional_sin_impl(dispatch_keys, x): 2025-12-04T09:17:30.0173083Z >>> if (x < 0).any(): 2025-12-04T09:17:30.0173341Z >>> return original_sin_kernel.call_boxed(dispatch_keys, x) 2025-12-04T09:17:30.0173606Z >>> else: 2025-12-04T09:17:30.0173793Z >>> return torch.zeros_like(x) 2025-12-04T09:17:30.0174006Z >>> 2025-12-04T09:17:30.0174180Z >>> lib = torch.library.Library("aten", "IMPL") 2025-12-04T09:17:30.0174623Z >>> # with_keyset=True so the first argument to the impl is the current DispatchKeySet 2025-12-04T09:17:30.0175002Z >>> which needs to be the first argument to ``kernel.call_boxed`` 2025-12-04T09:17:30.0175330Z >>> lib.impl("sin", conditional_sin_impl, "CPU", with_keyset=True) 2025-12-04T09:17:30.0175588Z >>> 2025-12-04T09:17:30.0175751Z >>> # Test the conditional behavior 2025-12-04T09:17:30.0176003Z >>> x_positive = torch.tensor([1.0, 2.0]) 2025-12-04T09:17:30.0176236Z >>> x_mixed = torch.tensor([-1.0, 2.0]) 2025-12-04T09:17:30.0176505Z >>> torch.sin(x_positive) 2025-12-04T09:17:30.0176755Z tensor([0., 0.]) 2025-12-04T09:17:30.0176938Z >>> torch.sin(x_mixed) 2025-12-04T09:17:30.0177135Z tensor([-0.8415, 0.9093]) 2025-12-04T09:17:30.0177323Z 2025-12-04T09:17:30.0177725Z Original Error: SyntaxError('invalid syntax', ('', 23, 7, 'which needs to be the first argument to ``kernel.call_boxed``\n', 23, 12)) 2025-12-04T09:17:30.0178458Z 2025-12-04T09:17:30.0178671Z which needs to be the first argument to ``kernel.call_boxed`` 2025-12-04T09:17:30.0178923Z ^ 2025-12-04T09:17:30.0179073Z warnings.warn(msg) 2025-12-04T09:17:30.0179242Z 2025-12-04T09:17:30.0179468Z --- Parse Warning: 4 / 17 --- 2025-12-04T09:17:30.0180220Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=is_available in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py line=70. 2025-12-04T09:17:30.0181135Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0181782Z Check if the current accelerator is available at runtime: it was build, all the 2025-12-04T09:17:30.0182211Z required drivers are available and at least one device is visible. 2025-12-04T09:17:30.0182543Z See :ref:`accelerator` for details. 2025-12-04T09:17:30.0182775Z 2025-12-04T09:17:30.0182910Z Returns: 2025-12-04T09:17:30.0183185Z bool: A boolean indicating if there is an available :ref:`accelerator`. 2025-12-04T09:17:30.0183504Z 2025-12-04T09:17:30.0183746Z .. note:: This API delegates to the device-specific version of `is_available`. 2025-12-04T09:17:30.0184158Z On CUDA, when the environment variable ``PYTORCH_NVML_BASED_CUDA_CHECK=1`` is set, 2025-12-04T09:17:30.0184580Z this function will NOT poison fork. Otherwise, it will. For more details, see 2025-12-04T09:17:30.0184930Z :ref:`multiprocessing-poison-fork-note`. 2025-12-04T09:17:30.0185157Z 2025-12-04T09:17:30.0185286Z Example:: 2025-12-04T09:17:30.0185434Z 2025-12-04T09:17:30.0185684Z >>> assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T09:17:30.0185992Z 2025-12-04T09:17:30.0186449Z Original Error: SyntaxError('invalid syntax', ('', 1, 41, 'assert torch.accelerator.is_available() "No available accelerators detected."\n', 1, 78)) 2025-12-04T09:17:30.0186948Z 2025-12-04T09:17:30.0187189Z assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T09:17:30.0187510Z ^ 2025-12-04T09:17:30.0187712Z warnings.warn(msg) 2025-12-04T09:17:30.0187878Z 2025-12-04T09:17:30.0188102Z --- Parse Warning: 5 / 17 --- 2025-12-04T09:17:30.0188856Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=synchronize in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py line=239. 2025-12-04T09:17:30.0189685Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0190065Z Wait for all kernels in all streams on the given device to complete. 2025-12-04T09:17:30.0190393Z 2025-12-04T09:17:30.0190520Z Args: 2025-12-04T09:17:30.0190814Z device (:class:`torch.device`, str, int, optional): device for which to synchronize. It must match 2025-12-04T09:17:30.0191264Z the current :ref:`accelerator` device type. If not given, 2025-12-04T09:17:30.0191628Z use :func:`torch.accelerator.current_device_index` by default. 2025-12-04T09:17:30.0191893Z 2025-12-04T09:17:30.0192168Z .. note:: This function is a no-op if the current :ref:`accelerator` is not initialized. 2025-12-04T09:17:30.0192546Z 2025-12-04T09:17:30.0192717Z Example:: 2025-12-04T09:17:30.0192862Z 2025-12-04T09:17:30.0193034Z >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA) 2025-12-04T09:17:30.0193396Z >>> assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T09:17:30.0193753Z >>> start_event = torch.Event(enable_timing=True) 2025-12-04T09:17:30.0194015Z >>> end_event = torch.Event(enable_timing=True) 2025-12-04T09:17:30.0194253Z >>> start_event.record() 2025-12-04T09:17:30.0194555Z >>> tensor = torch.randn(100, device=torch.accelerator.current_accelerator()) 2025-12-04T09:17:30.0194862Z >>> sum = torch.sum(tensor) 2025-12-04T09:17:30.0195072Z >>> end_event.record() 2025-12-04T09:17:30.0195303Z >>> torch.accelerator.synchronize() 2025-12-04T09:17:30.0195579Z >>> elapsed_time_ms = start_event.elapsed_time(end_event) 2025-12-04T09:17:30.0195832Z 2025-12-04T09:17:30.0196339Z Original Error: SyntaxError('invalid syntax', ('', 2, 41, 'assert torch.accelerator.is_available() "No available accelerators detected."\n', 2, 78)) 2025-12-04T09:17:30.0196857Z 2025-12-04T09:17:30.0197103Z assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T09:17:30.0197427Z ^ 2025-12-04T09:17:30.0197635Z warnings.warn(msg) 2025-12-04T09:17:30.0197796Z 2025-12-04T09:17:30.0198013Z --- Parse Warning: 6 / 17 --- 2025-12-04T09:17:30.0198731Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=cudart in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py line=448. 2025-12-04T09:17:30.0199519Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0199844Z Retrieves the CUDA runtime API module. 2025-12-04T09:17:30.0200048Z 2025-12-04T09:17:30.0200175Z 2025-12-04T09:17:30.0200416Z This function initializes the CUDA runtime environment if it is not already 2025-12-04T09:17:30.0200822Z initialized and returns the CUDA runtime API module (_cudart). The CUDA 2025-12-04T09:17:30.0201212Z runtime API module provides access to various CUDA runtime functions. 2025-12-04T09:17:30.0201502Z 2025-12-04T09:17:30.0201629Z Args: 2025-12-04T09:17:30.0201776Z ``None`` 2025-12-04T09:17:30.0201930Z 2025-12-04T09:17:30.0202056Z Returns: 2025-12-04T09:17:30.0202251Z module: The CUDA runtime API module (_cudart). 2025-12-04T09:17:30.0202477Z 2025-12-04T09:17:30.0202600Z Raises: 2025-12-04T09:17:30.0202850Z RuntimeError: If CUDA cannot be re-initialized in a forked subprocess. 2025-12-04T09:17:30.0203328Z AssertionError: If PyTorch is not compiled with CUDA support or if libcudart functions are unavailable. 2025-12-04T09:17:30.0203697Z 2025-12-04T09:17:30.0203872Z Example of CUDA operations with profiling: 2025-12-04T09:17:30.0204104Z >>> import torch 2025-12-04T09:17:30.0204310Z >>> from torch.cuda import cudart, check_error 2025-12-04T09:17:30.0204539Z >>> import os 2025-12-04T09:17:30.0204705Z >>> 2025-12-04T09:17:30.0204868Z >>> os.environ["CUDA_PROFILE"] = "1" 2025-12-04T09:17:30.0205140Z >>> 2025-12-04T09:17:30.0205325Z >>> def perform_cuda_operations_with_streams(): 2025-12-04T09:17:30.0205573Z >>> stream = torch.cuda.Stream() 2025-12-04T09:17:30.0205805Z >>> with torch.cuda.stream(stream): 2025-12-04T09:17:30.0206055Z >>> x = torch.randn(100, 100, device='cuda') 2025-12-04T09:17:30.0206305Z >>> y = torch.randn(100, 100, device='cuda') 2025-12-04T09:17:30.0206533Z >>> z = torch.mul(x, y) 2025-12-04T09:17:30.0206781Z >>> return z 2025-12-04T09:17:30.0206991Z >>> 2025-12-04T09:17:30.0207154Z >>> torch.cuda.synchronize() 2025-12-04T09:17:30.0207394Z >>> print("====== Start nsys profiling ======") 2025-12-04T09:17:30.0207655Z >>> check_error(cudart().cudaProfilerStart()) 2025-12-04T09:17:30.0207927Z >>> with torch.autograd.profiler.emit_nvtx(): 2025-12-04T09:17:30.0208202Z >>> result = perform_cuda_operations_with_streams() 2025-12-04T09:17:30.0208472Z >>> print("CUDA operations completed.") 2025-12-04T09:17:30.0208749Z >>> check_error(torch.cuda.cudart().cudaProfilerStop()) 2025-12-04T09:17:30.0209025Z >>> print("====== End nsys profiling ======") 2025-12-04T09:17:30.0209244Z 2025-12-04T09:17:30.0209456Z To run this example and save the profiling information, execute: 2025-12-04T09:17:30.0209918Z >>> $ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py 2025-12-04T09:17:30.0210296Z 2025-12-04T09:17:30.0210598Z This command profiles the CUDA operations in the provided script and saves 2025-12-04T09:17:30.0210979Z the profiling information to a file named `trace_name.prof`. 2025-12-04T09:17:30.0211338Z The `--profile-from-start off` option ensures that profiling starts only 2025-12-04T09:17:30.0211675Z after the `cudaProfilerStart` call in the script. 2025-12-04T09:17:30.0211998Z The `--csv` and `--print-summary` options format the profiling output as a 2025-12-04T09:17:30.0212312Z CSV file and print a summary, respectively. 2025-12-04T09:17:30.0212715Z The `-o` option specifies the output file name, and the `-f` option forces the 2025-12-04T09:17:30.0213059Z overwrite of the output file if it already exists. 2025-12-04T09:17:30.0213289Z 2025-12-04T09:17:30.0213805Z Original Error: SyntaxError('invalid syntax', ('', 1, 1, '$ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py\n', 1, 2)) 2025-12-04T09:17:30.0214385Z 2025-12-04T09:17:30.0214698Z $ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py 2025-12-04T09:17:30.0215062Z ^ 2025-12-04T09:17:30.0215208Z warnings.warn(msg) 2025-12-04T09:17:30.0215377Z 2025-12-04T09:17:30.0215579Z --- Parse Warning: 7 / 17 --- 2025-12-04T09:17:30.0216291Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=vmap in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/apis.py line=39. 2025-12-04T09:17:30.0217079Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0217375Z 2025-12-04T09:17:30.0217591Z vmap is the vectorizing map; ``vmap(func)`` returns a new function that 2025-12-04T09:17:30.0217950Z maps ``func`` over some dimension of the inputs. Semantically, vmap 2025-12-04T09:17:30.0218304Z pushes the map into PyTorch operations called by ``func``, effectively 2025-12-04T09:17:30.0218598Z vectorizing those operations. 2025-12-04T09:17:30.0218782Z 2025-12-04T09:17:30.0218993Z vmap is useful for handling batch dimensions: one can write a function 2025-12-04T09:17:30.0219357Z ``func`` that runs on examples and then lift it to a function that can 2025-12-04T09:17:30.0219747Z take batches of examples with ``vmap(func)``. vmap can also be used to 2025-12-04T09:17:30.0220075Z compute batched gradients when composed with autograd. 2025-12-04T09:17:30.0220315Z 2025-12-04T09:17:30.0220444Z .. note:: 2025-12-04T09:17:30.0220659Z :func:`torch.vmap` is aliased to :func:`torch.func.vmap` for 2025-12-04T09:17:30.0220959Z convenience. Use whichever one you'd like. 2025-12-04T09:17:30.0221174Z 2025-12-04T09:17:30.0221299Z Args: 2025-12-04T09:17:30.0221560Z func (function): A Python function that takes one or more arguments. 2025-12-04T09:17:30.0221896Z Must return one or more Tensors. 2025-12-04T09:17:30.0222171Z in_dims (int or nested structure): Specifies which dimension of the 2025-12-04T09:17:30.0222494Z inputs should be mapped over. ``in_dims`` should have a 2025-12-04T09:17:30.0222818Z structure like the inputs. If the ``in_dim`` for a particular 2025-12-04T09:17:30.0223151Z input is None, then that indicates there is no map dimension. 2025-12-04T09:17:30.0223419Z Default: 0. 2025-12-04T09:17:30.0223658Z out_dims (int or Tuple[int]): Specifies where the mapped dimension 2025-12-04T09:17:30.0224001Z should appear in the outputs. If ``out_dims`` is a Tuple, then 2025-12-04T09:17:30.0224309Z it should have one element per output. Default: 0. 2025-12-04T09:17:30.0224619Z randomness (str): Specifies whether the randomness in this 2025-12-04T09:17:30.0224982Z vmap should be the same or different across batches. If 'different', 2025-12-04T09:17:30.0225379Z the randomness for each batch will be different. If 'same', the 2025-12-04T09:17:30.0225735Z randomness will be the same across batches. If 'error', any calls to 2025-12-04T09:17:30.0226101Z random functions will error. Default: 'error'. WARNING: this flag 2025-12-04T09:17:30.0226467Z only applies to random PyTorch operations and does not apply to 2025-12-04T09:17:30.0226777Z Python's random module or numpy randomness. 2025-12-04T09:17:30.0227103Z chunk_size (None or int): If None (default), apply a single vmap over inputs. 2025-12-04T09:17:30.0227486Z If not None, then compute the vmap :attr:`chunk_size` samples at a time. 2025-12-04T09:17:30.0227882Z Note that :attr:`chunk_size=1` is equivalent to computing the vmap with a for-loop. 2025-12-04T09:17:30.0228315Z If you run into memory issues computing the vmap, please try a non-None chunk_size. 2025-12-04T09:17:30.0228628Z 2025-12-04T09:17:30.0228761Z Returns: 2025-12-04T09:17:30.0228973Z Returns a new "batched" function. It takes the same inputs as 2025-12-04T09:17:30.0229303Z ``func``, except each input has an extra dimension at the index 2025-12-04T09:17:30.0229645Z specified by ``in_dims``. It takes returns the same outputs as 2025-12-04T09:17:30.0229970Z ``func``, except each output has an extra dimension at the index 2025-12-04T09:17:30.0230244Z specified by ``out_dims``. 2025-12-04T09:17:30.0230432Z 2025-12-04T09:17:30.0230558Z .. warning: 2025-12-04T09:17:30.0230786Z :func:`vmap` works best with functional-style code. Please do not 2025-12-04T09:17:30.0231124Z perform any side-effects in ``func``, with the exception of 2025-12-04T09:17:30.0231484Z in-place PyTorch operations. Examples of side-effects include mutating 2025-12-04T09:17:30.0231875Z Python data structures and assigning values to variables not captured 2025-12-04T09:17:30.0232168Z in ``func``. 2025-12-04T09:17:30.0232327Z 2025-12-04T09:17:30.0232558Z One example of using :func:`vmap` is to compute batched dot products. PyTorch 2025-12-04T09:17:30.0232940Z doesn't provide a batched ``torch.dot`` API; instead of unsuccessfully 2025-12-04T09:17:30.0233309Z rummaging through docs, use :func:`vmap` to construct a new function. 2025-12-04T09:17:30.0233638Z 2025-12-04T09:17:30.0233788Z >>> torch.dot # [D], [D] -> [] 2025-12-04T09:17:30.0234066Z >>> batched_dot = torch.func.vmap(torch.dot) # [N, D], [N, D] -> [N] 2025-12-04T09:17:30.0234370Z >>> x, y = torch.randn(2, 5), torch.randn(2, 5) 2025-12-04T09:17:30.0234594Z >>> batched_dot(x, y) 2025-12-04T09:17:30.0234773Z 2025-12-04T09:17:30.0235001Z :func:`vmap` can be helpful in hiding batch dimensions, leading to a simpler 2025-12-04T09:17:30.0235294Z model authoring experience. 2025-12-04T09:17:30.0235530Z 2025-12-04T09:17:30.0235733Z >>> batch_size, feature_size = 3, 5 2025-12-04T09:17:30.0235996Z >>> weights = torch.randn(feature_size, requires_grad=True) 2025-12-04T09:17:30.0236248Z >>> 2025-12-04T09:17:30.0236405Z >>> def model(feature_vec): 2025-12-04T09:17:30.0236624Z >>> # Very simple linear model with activation 2025-12-04T09:17:30.0236879Z >>> return feature_vec.dot(weights).relu() 2025-12-04T09:17:30.0237091Z >>> 2025-12-04T09:17:30.0237279Z >>> examples = torch.randn(batch_size, feature_size) 2025-12-04T09:17:30.0237535Z >>> result = torch.vmap(model)(examples) 2025-12-04T09:17:30.0237747Z 2025-12-04T09:17:30.0237987Z :func:`vmap` can also help vectorize computations that were previously difficult 2025-12-04T09:17:30.0238382Z or impossible to batch. One example is higher-order gradient computation. 2025-12-04T09:17:30.0238772Z The PyTorch autograd engine computes vjps (vector-Jacobian products). 2025-12-04T09:17:30.0239270Z Computing a full Jacobian matrix for some function f: R^N -> R^N usually 2025-12-04T09:17:30.0239665Z requires N calls to ``autograd.grad``, one per Jacobian row. Using :func:`vmap`, 2025-12-04T09:17:30.0240068Z we can vectorize the whole computation, computing the Jacobian in a single 2025-12-04T09:17:30.0240373Z call to ``autograd.grad``. 2025-12-04T09:17:30.0240556Z 2025-12-04T09:17:30.0240681Z >>> # Setup 2025-12-04T09:17:30.0240841Z >>> N = 5 2025-12-04T09:17:30.0240999Z >>> f = lambda x: x**2 2025-12-04T09:17:30.0241210Z >>> x = torch.randn(N, requires_grad=True) 2025-12-04T09:17:30.0241427Z >>> y = f(x) 2025-12-04T09:17:30.0241589Z >>> I_N = torch.eye(N) 2025-12-04T09:17:30.0241753Z >>> 2025-12-04T09:17:30.0241904Z >>> # Sequential approach 2025-12-04T09:17:30.0242178Z >>> jacobian_rows = [torch.autograd.grad(y, x, v, retain_graph=True)[0] 2025-12-04T09:17:30.0242471Z >>> for v in I_N.unbind()] 2025-12-04T09:17:30.0242713Z >>> jacobian = torch.stack(jacobian_rows) 2025-12-04T09:17:30.0242922Z >>> 2025-12-04T09:17:30.0243084Z >>> # vectorized gradient computation 2025-12-04T09:17:30.0243295Z >>> def get_vjp(v): 2025-12-04T09:17:30.0243494Z >>> return torch.autograd.grad(y, x, v) 2025-12-04T09:17:30.0243736Z >>> jacobian = torch.vmap(get_vjp)(I_N) 2025-12-04T09:17:30.0243940Z 2025-12-04T09:17:30.0244193Z :func:`vmap` can also be nested, producing an output with multiple batched dimensions 2025-12-04T09:17:30.0244503Z 2025-12-04T09:17:30.0244639Z >>> torch.dot # [D], [D] -> [] 2025-12-04T09:17:30.0244852Z >>> batched_dot = torch.vmap( 2025-12-04T09:17:30.0245060Z ... torch.vmap(torch.dot) 2025-12-04T09:17:30.0245272Z ... ) # [N1, N0, D], [N1, N0, D] -> [N1, N0] 2025-12-04T09:17:30.0245522Z >>> x, y = torch.randn(2, 3, 5), torch.randn(2, 3, 5) 2025-12-04T09:17:30.0245780Z >>> batched_dot(x, y) # tensor of size [2, 3] 2025-12-04T09:17:30.0245997Z 2025-12-04T09:17:30.0246225Z If the inputs are not batched along the first dimension, ``in_dims`` specifies 2025-12-04T09:17:30.0246569Z the dimension that each inputs are batched along as 2025-12-04T09:17:30.0246799Z 2025-12-04T09:17:30.0246939Z >>> torch.dot # [N], [N] -> [] 2025-12-04T09:17:30.0247271Z >>> batched_dot = torch.vmap(torch.dot, in_dims=1) # [N, D], [N, D] -> [D] 2025-12-04T09:17:30.0247582Z >>> x, y = torch.randn(2, 5), torch.randn(2, 5) 2025-12-04T09:17:30.0247795Z >>> batched_dot( 2025-12-04T09:17:30.0247964Z ... x, y 2025-12-04T09:17:30.0248199Z ... ) # output is [5] instead of [2] if batched along the 0th dimension 2025-12-04T09:17:30.0248453Z 2025-12-04T09:17:30.0248705Z If there are multiple inputs each of which is batched along different dimensions, 2025-12-04T09:17:30.0249137Z ``in_dims`` must be a tuple with the batch dimension for each input as 2025-12-04T09:17:30.0249443Z 2025-12-04T09:17:30.0249590Z >>> torch.dot # [D], [D] -> [] 2025-12-04T09:17:30.0249881Z >>> batched_dot = torch.vmap(torch.dot, in_dims=(0, None)) # [N, D], [D] -> [N] 2025-12-04T09:17:30.0250206Z >>> x, y = torch.randn(2, 5), torch.randn(5) 2025-12-04T09:17:30.0250420Z >>> batched_dot( 2025-12-04T09:17:30.0250589Z ... x, y 2025-12-04T09:17:30.0250819Z ... ) # second arg doesn't have a batch dim because in_dim[1] was None 2025-12-04T09:17:30.0251077Z 2025-12-04T09:17:30.0251312Z If the input is a Python struct, ``in_dims`` must be a tuple containing a struct 2025-12-04T09:17:30.0251623Z matching the shape of the input: 2025-12-04T09:17:30.0251815Z 2025-12-04T09:17:30.0251977Z >>> f = lambda dict: torch.dot(dict["x"], dict["y"]) 2025-12-04T09:17:30.0252228Z >>> x, y = torch.randn(2, 5), torch.randn(5) 2025-12-04T09:17:30.0252507Z >>> input = {"x": x, "y": y} 2025-12-04T09:17:30.0252814Z >>> batched_dot = torch.vmap(f, in_dims=({"x": 0, "y": None},)) 2025-12-04T09:17:30.0253083Z >>> batched_dot(input) 2025-12-04T09:17:30.0253258Z 2025-12-04T09:17:30.0253510Z By default, the output is batched along the first dimension. However, it can be batched 2025-12-04T09:17:30.0253857Z along any dimension by using ``out_dims`` 2025-12-04T09:17:30.0254068Z 2025-12-04T09:17:30.0254200Z >>> f = lambda x: x**2 2025-12-04T09:17:30.0254385Z >>> x = torch.randn(2, 5) 2025-12-04T09:17:30.0254602Z >>> batched_pow = torch.vmap(f, out_dims=1) 2025-12-04T09:17:30.0254832Z >>> batched_pow(x) # [5, 2] 2025-12-04T09:17:30.0255020Z 2025-12-04T09:17:30.0255289Z For any function that uses kwargs, the returned function will not batch the kwargs but will 2025-12-04T09:17:30.0255625Z accept kwargs 2025-12-04T09:17:30.0255769Z 2025-12-04T09:17:30.0255913Z >>> x = torch.randn([2, 5]) 2025-12-04T09:17:30.0256120Z >>> def fn(x, scale=4.): 2025-12-04T09:17:30.0256307Z >>> return x * scale 2025-12-04T09:17:30.0256481Z >>> 2025-12-04T09:17:30.0256638Z >>> batched_pow = torch.vmap(fn) 2025-12-04T09:17:30.0256873Z >>> assert torch.allclose(batched_pow(x), x * 4) 2025-12-04T09:17:30.0257199Z >>> batched_pow(x, scale=x) # scale is not batched, output has shape [2, 2, 5] 2025-12-04T09:17:30.0257484Z 2025-12-04T09:17:30.0257610Z .. note:: 2025-12-04T09:17:30.0257848Z vmap does not provide general autobatching or handle variable-length 2025-12-04T09:17:30.0258143Z sequences out of the box. 2025-12-04T09:17:30.0258329Z 2025-12-04T09:17:30.0258752Z Original Error: IndentationError('expected an indented block after function definition on line 4', ('', 5, 1, '_._ = None\n', 5, 2)) 2025-12-04T09:17:30.0259223Z 2025-12-04T09:17:30.0259350Z _._ = None 2025-12-04T09:17:30.0259489Z ^ 2025-12-04T09:17:30.0259636Z warnings.warn(msg) 2025-12-04T09:17:30.0259805Z 2025-12-04T09:17:30.0260006Z --- Parse Warning: 8 / 17 --- 2025-12-04T09:17:30.0260736Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=grad in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/apis.py line=306. 2025-12-04T09:17:30.0261569Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0262175Z ``grad`` operator helps computing gradients of ``func`` with respect to the 2025-12-04T09:17:30.0262538Z input(s) specified by ``argnums``. This operator can be nested to 2025-12-04T09:17:30.0262824Z compute higher-order gradients. 2025-12-04T09:17:30.0263026Z 2025-12-04T09:17:30.0263152Z Args: 2025-12-04T09:17:30.0263380Z func (Callable): A Python function that takes one or more arguments. 2025-12-04T09:17:30.0263871Z Must return a single-element Tensor. If specified ``has_aux`` equals ``True``, 2025-12-04T09:17:30.0264384Z function can return a tuple of single-element Tensor and other auxiliary objects: 2025-12-04T09:17:30.0264704Z ``(output, aux)``. 2025-12-04T09:17:30.0265018Z argnums (int or Tuple[int]): Specifies arguments to compute gradients with respect to. 2025-12-04T09:17:30.0265417Z ``argnums`` can be single integer or tuple of integers. Default: 0. 2025-12-04T09:17:30.0265772Z has_aux (bool): Flag indicating that ``func`` returns a tensor and other 2025-12-04T09:17:30.0266121Z auxiliary objects: ``(output, aux)``. Default: False. 2025-12-04T09:17:30.0266363Z 2025-12-04T09:17:30.0266492Z Returns: 2025-12-04T09:17:30.0266775Z Function to compute gradients with respect to its inputs. By default, the output of 2025-12-04T09:17:30.0267201Z the function is the gradient tensor(s) with respect to the first argument. 2025-12-04T09:17:30.0267692Z If specified ``has_aux`` equals ``True``, tuple of gradients and output auxiliary objects 2025-12-04T09:17:30.0268121Z is returned. If ``argnums`` is a tuple of integers, a tuple of output gradients with 2025-12-04T09:17:30.0268467Z respect to each ``argnums`` value is returned. 2025-12-04T09:17:30.0268690Z 2025-12-04T09:17:30.0268845Z Example of using ``grad``: 2025-12-04T09:17:30.0269034Z 2025-12-04T09:17:30.0269175Z >>> # xdoctest: +SKIP 2025-12-04T09:17:30.0269381Z >>> from torch.func import grad 2025-12-04T09:17:30.0269593Z >>> x = torch.randn([]) 2025-12-04T09:17:30.0269820Z >>> cos_x = grad(lambda x: torch.sin(x))(x) 2025-12-04T09:17:30.0270073Z >>> assert torch.allclose(cos_x, x.cos()) 2025-12-04T09:17:30.0270283Z >>> 2025-12-04T09:17:30.0270450Z >>> # Second-order gradients 2025-12-04T09:17:30.0270699Z >>> neg_sin_x = grad(grad(lambda x: torch.sin(x)))(x) 2025-12-04T09:17:30.0270971Z >>> assert torch.allclose(neg_sin_x, -x.sin()) 2025-12-04T09:17:30.0271197Z 2025-12-04T09:17:30.0271447Z When composed with ``vmap``, ``grad`` can be used to compute per-sample-gradients: 2025-12-04T09:17:30.0271745Z 2025-12-04T09:17:30.0271890Z >>> # xdoctest: +SKIP 2025-12-04T09:17:30.0272105Z >>> from torch.func import grad, vmap 2025-12-04T09:17:30.0272348Z >>> batch_size, feature_size = 3, 5 2025-12-04T09:17:30.0272551Z >>> 2025-12-04T09:17:30.0272722Z >>> def model(weights, feature_vec): 2025-12-04T09:17:30.0272969Z >>> # Very simple linear model with activation 2025-12-04T09:17:30.0273210Z >>> assert feature_vec.dim() == 1 2025-12-04T09:17:30.0273450Z >>> return feature_vec.dot(weights).relu() 2025-12-04T09:17:30.0273663Z >>> 2025-12-04T09:17:30.0273844Z >>> def compute_loss(weights, example, target): 2025-12-04T09:17:30.0274099Z >>> y = model(weights, example) 2025-12-04T09:17:30.0274358Z >>> return ((y - target) ** 2).mean() # MSELoss 2025-12-04T09:17:30.0274594Z >>> 2025-12-04T09:17:30.0274797Z >>> weights = torch.randn(feature_size, requires_grad=True) 2025-12-04T09:17:30.0275095Z >>> examples = torch.randn(batch_size, feature_size) 2025-12-04T09:17:30.0275423Z >>> targets = torch.randn(batch_size) 2025-12-04T09:17:30.0275660Z >>> inputs = (weights, examples, targets) 2025-12-04T09:17:30.0275979Z >>> grad_weight_per_example = vmap(grad(compute_loss), in_dims=(None, 0, 0))( 2025-12-04T09:17:30.0276281Z ... *inputs 2025-12-04T09:17:30.0276445Z ... ) 2025-12-04T09:17:30.0276591Z 2025-12-04T09:17:30.0276802Z Example of using ``grad`` with ``has_aux`` and ``argnums``: 2025-12-04T09:17:30.0277043Z 2025-12-04T09:17:30.0277229Z >>> # xdoctest: +SKIP 2025-12-04T09:17:30.0277469Z >>> from torch.func import grad 2025-12-04T09:17:30.0277698Z >>> def my_loss_func(y, y_pred): 2025-12-04T09:17:30.0277929Z >>> loss_per_sample = (0.5 * y_pred - y) ** 2 2025-12-04T09:17:30.0278173Z >>> loss = loss_per_sample.mean() 2025-12-04T09:17:30.0278416Z >>> return loss, (y_pred, loss_per_sample) 2025-12-04T09:17:30.0278625Z >>> 2025-12-04T09:17:30.0278814Z >>> fn = grad(my_loss_func, argnums=(0, 1), has_aux=True) 2025-12-04T09:17:30.0279064Z >>> y_true = torch.rand(4) 2025-12-04T09:17:30.0279303Z >>> y_preds = torch.rand(4, requires_grad=True) 2025-12-04T09:17:30.0279545Z >>> out = fn(y_true, y_preds) 2025-12-04T09:17:30.0279855Z >>> # > output is ((grads w.r.t y_true, grads w.r.t y_preds), (y_pred, loss_per_sample)) 2025-12-04T09:17:30.0280152Z 2025-12-04T09:17:30.0280281Z .. note:: 2025-12-04T09:17:30.0280537Z Using PyTorch ``torch.no_grad`` together with ``grad``. 2025-12-04T09:17:30.0280785Z 2025-12-04T09:17:30.0280962Z Case 1: Using ``torch.no_grad`` inside a function: 2025-12-04T09:17:30.0281192Z 2025-12-04T09:17:30.0281337Z >>> # xdoctest: +SKIP 2025-12-04T09:17:30.0281532Z >>> def f(x): 2025-12-04T09:17:30.0281728Z >>> with torch.no_grad(): 2025-12-04T09:17:30.0281958Z >>> c = x ** 2 2025-12-04T09:17:30.0282154Z >>> return x - c 2025-12-04T09:17:30.0282344Z 2025-12-04T09:17:30.0282557Z In this case, ``grad(f)(x)`` will respect the inner ``torch.no_grad``. 2025-12-04T09:17:30.0282817Z 2025-12-04T09:17:30.0283017Z Case 2: Using ``grad`` inside ``torch.no_grad`` context manager: 2025-12-04T09:17:30.0283276Z 2025-12-04T09:17:30.0283419Z >>> # xdoctest: +SKIP 2025-12-04T09:17:30.0283619Z >>> with torch.no_grad(): 2025-12-04T09:17:30.0283835Z >>> grad(f)(x) 2025-12-04T09:17:30.0284019Z 2025-12-04T09:17:30.0284235Z In this case, ``grad`` will respect the inner ``torch.no_grad``, but not the 2025-12-04T09:17:30.0284613Z outer one. This is because ``grad`` is a "function transform": its result 2025-12-04T09:17:30.0285009Z should not depend on the result of a context manager outside of ``f``. 2025-12-04T09:17:30.0285292Z 2025-12-04T09:17:30.0285425Z 2025-12-04T09:17:30.0285860Z Original Error: IndentationError('expected an indented block after function definition on line 5', ('', 6, 1, '_._ = None\n', 6, 2)) 2025-12-04T09:17:30.0286338Z 2025-12-04T09:17:30.0286461Z _._ = None 2025-12-04T09:17:30.0286603Z ^ 2025-12-04T09:17:30.0286751Z warnings.warn(msg) 2025-12-04T09:17:30.0286914Z 2025-12-04T09:17:30.0287139Z --- Parse Warning: 9 / 17 --- 2025-12-04T09:17:30.0287940Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=CustomOpDef.register_fake in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py line=402. 2025-12-04T09:17:30.0288815Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0289185Z Register a FakeTensor implementation for this custom op. 2025-12-04T09:17:30.0289480Z 2025-12-04T09:17:30.0289722Z This is necessary to get the operator to work efficiently with torch.compile. 2025-12-04T09:17:30.0290015Z 2025-12-04T09:17:30.0290239Z The Fake impl (sometimes also known as a meta kernel or abstract impl) 2025-12-04T09:17:30.0290612Z specifies the behavior of this operator on Tensors that carry no data. 2025-12-04T09:17:30.0290945Z Given some input Tensors with certain properties 2025-12-04T09:17:30.0291333Z (sizes/strides/storage_offset/device), it specifies what the properties of 2025-12-04T09:17:30.0291690Z the output Tensors are. 2025-12-04T09:17:30.0291880Z 2025-12-04T09:17:30.0292085Z Please see :func:`torch.library.register_fake` for more details. 2025-12-04T09:17:30.0292421Z 2025-12-04T09:17:30.0292553Z Args: 2025-12-04T09:17:30.0292763Z fn (Callable): The function to register as the FakeTensor 2025-12-04T09:17:30.0293033Z implementation. 2025-12-04T09:17:30.0293218Z 2025-12-04T09:17:30.0293345Z Examples: 2025-12-04T09:17:30.0293514Z >>> import torch 2025-12-04T09:17:30.0293711Z >>> import numpy as np 2025-12-04T09:17:30.0293923Z >>> from torch import Tensor 2025-12-04T09:17:30.0294125Z >>> 2025-12-04T09:17:30.0294353Z >>> # Example 1: an operator without data-dependent output shape 2025-12-04T09:17:30.0294687Z >>> @torch.library.custom_op("mylib::linear", mutates_args=()) 2025-12-04T09:17:30.0295064Z >>> def linear(x: Tensor, weight: Tensor, bias: Tensor) -> Tensor: 2025-12-04T09:17:30.0295355Z >>> return (x @ weight.t()) + bias 2025-12-04T09:17:30.0295567Z >>> 2025-12-04T09:17:30.0295728Z >>> @linear.register_fake 2025-12-04T09:17:30.0295949Z >>> def _(x, weight, bias): 2025-12-04T09:17:30.0296168Z >>> assert x.dim() == 2 2025-12-04T09:17:30.0296381Z >>> assert weight.dim() == 2 2025-12-04T09:17:30.0296606Z >>> assert bias.dim() == 1 2025-12-04T09:17:30.0296839Z >>> assert x.shape[1] == weight.shape[1] 2025-12-04T09:17:30.0297103Z >>> assert weight.shape[0] == bias.shape[0] 2025-12-04T09:17:30.0297364Z >>> assert x.device == weight.device 2025-12-04T09:17:30.0297630Z >>> return x.new_empty(x.size(0), weight.size(0)) 2025-12-04T09:17:30.0297865Z >>> 2025-12-04T09:17:30.0298030Z >>> x = torch.randn(2, 2) 2025-12-04T09:17:30.0298252Z >>> weight = torch.randn(2, 2) 2025-12-04T09:17:30.0298473Z >>> bias = torch.randn(2) 2025-12-04T09:17:30.0298701Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T09:17:30.0299003Z >>> out = torch.compile(linear, fullgraph=True)(x, weight, bias) 2025-12-04T09:17:30.0299301Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T09:17:30.0299640Z >>> assert torch.allclose(out, torch.nn.functional.linear(x, weight, bias)) 2025-12-04T09:17:30.0299939Z >>> 2025-12-04T09:17:30.0300165Z >>> # Example 2: an operator with data-dependent output shape 2025-12-04T09:17:30.0300501Z >>> @torch.library.custom_op("mylib::nonzero", mutates_args=()) 2025-12-04T09:17:30.0300787Z >>> def nonzero(x: Tensor) -> Tensor: 2025-12-04T09:17:30.0301017Z >>> x_np = x.cpu().numpy() 2025-12-04T09:17:30.0301253Z >>> res = np.stack(np.nonzero(x_np), axis=1) 2025-12-04T09:17:30.0301514Z >>> return torch.tensor(res, device=x.device) 2025-12-04T09:17:30.0301734Z >>> 2025-12-04T09:17:30.0301903Z >>> @nonzero.register_fake 2025-12-04T09:17:30.0302111Z >>> def _(x): 2025-12-04T09:17:30.0302388Z >>> # Number of nonzero-elements is data-dependent. 2025-12-04T09:17:30.0302680Z >>> # Since we cannot peek at the data in an abstract impl, 2025-12-04T09:17:30.0302976Z >>> # we use the ctx object to construct a new symint that 2025-12-04T09:17:30.0303246Z >>> # represents the data-dependent size. 2025-12-04T09:17:30.0303499Z >>> ctx = torch.library.get_ctx() 2025-12-04T09:17:30.0303737Z >>> nnz = ctx.new_dynamic_size() 2025-12-04T09:17:30.0304008Z >>> shape = [nnz, x.dim()] 2025-12-04T09:17:30.0304299Z >>> result = x.new_empty(shape, dtype=torch.int64) 2025-12-04T09:17:30.0304543Z >>> return result 2025-12-04T09:17:30.0304734Z >>> 2025-12-04T09:17:30.0304902Z >>> x = torch.tensor([0, 1, 2, 0, 0, 1]) 2025-12-04T09:17:30.0305150Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T09:17:30.0305419Z >>> out = torch.compile(nonzero, fullgraph=True)(x) 2025-12-04T09:17:30.0305676Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T09:17:30.0305937Z >>> assert torch.allclose(out, x.nonzero()) 2025-12-04T09:17:30.0306159Z 2025-12-04T09:17:30.0306283Z 2025-12-04T09:17:30.0306729Z Original Error: IndentationError('expected an indented block after function definition on line 36', ('', 37, 1, '_._ = None\n', 37, 2)) 2025-12-04T09:17:30.0307213Z 2025-12-04T09:17:30.0307348Z _._ = None 2025-12-04T09:17:30.0307489Z ^ 2025-12-04T09:17:30.0307679Z warnings.warn(msg) 2025-12-04T09:17:30.0307847Z 2025-12-04T09:17:30.0308049Z --- Parse Warning: 10 / 17 --- 2025-12-04T09:17:30.0308848Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=unsafe_generate_fake_kernels in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_profile.py line=94. 2025-12-04T09:17:30.0309739Z Caused by: DoctestParseError('Failed to parse doctest in _label_docsrc_lines') 2025-12-04T09:17:30.0310050Z 2025-12-04T09:17:30.0310273Z Registers a fake kernel based on the given operator profiles. This fake 2025-12-04T09:17:30.0310691Z kernel registration will override any existing fake kernel registrations. 2025-12-04T09:17:30.0311000Z 2025-12-04T09:17:30.0311222Z The input is a dictionary mapping operator names to a set of operator 2025-12-04T09:17:30.0311612Z profiles, which we will use to generate fake kernels. The operator profiles 2025-12-04T09:17:30.0311994Z are a record of the input and output tensor metadata. Based on this 2025-12-04T09:17:30.0312368Z information we will match a given input to the recorded profile, and return 2025-12-04T09:17:30.0312754Z an output with the same metadata as in the recorded profile. If a profile 2025-12-04T09:17:30.0313084Z doesn't exist then an exception will be thrown. 2025-12-04T09:17:30.0313313Z 2025-12-04T09:17:30.0313540Z The fake kernel generation is considered unsafe because it relies on the 2025-12-04T09:17:30.0313931Z rigid, pre-defined operator profiles that do not account for potential 2025-12-04T09:17:30.0314328Z variations in output behavior. Specifically, the generated kernels assume a 2025-12-04T09:17:30.0314733Z fixed relationship between input and output ranks. However, in reality, it's 2025-12-04T09:17:30.0315144Z possible that data-dependent operations may produce outputs of different 2025-12-04T09:17:30.0315532Z ranks even when given inputs of the same rank. The generated fake kernels 2025-12-04T09:17:30.0315905Z are inflexible and unable to accommodate these nuances, making them 2025-12-04T09:17:30.0316187Z potentially unsafe. 2025-12-04T09:17:30.0316350Z 2025-12-04T09:17:30.0316482Z Args: 2025-12-04T09:17:30.0316704Z op_profiles (dict[str, set[OpProfile]]): A dictionary mapping operator 2025-12-04T09:17:30.0317109Z name to a set of operator profiles from which we will generate fake 2025-12-04T09:17:30.0317385Z kernels. 2025-12-04T09:17:30.0317534Z 2025-12-04T09:17:30.0317668Z Examples: 2025-12-04T09:17:30.0317815Z 2025-12-04T09:17:30.0318001Z >>> # Example: Registering an op-profile from draft-export 2025-12-04T09:17:30.0318275Z >>> import torch 2025-12-04T09:17:30.0318506Z >>> from torch.export._draft_export import draft_export 2025-12-04T09:17:30.0318749Z >>> 2025-12-04T09:17:30.0319049Z >>> @torch.library.custom_op("mylib::foo", mutates_args=()) 2025-12-04T09:17:30.0319334Z >>> def foo(x: Tensor, y: Tensor) -> Tensor: 2025-12-04T09:17:30.0319561Z >>> return x + y 2025-12-04T09:17:30.0319734Z >>> 2025-12-04T09:17:30.0319895Z >>> class M(torch.nn.Module): 2025-12-04T09:17:30.0320110Z >>> def forward(self, a, b): 2025-12-04T09:17:30.0320352Z >>> res = torch.ops.mylib.foo(a, b) # no fake impl 2025-12-04T09:17:30.0320604Z >>> return res 2025-12-04T09:17:30.0320786Z >>> 2025-12-04T09:17:30.0320976Z >>> ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4)) 2025-12-04T09:17:30.0321224Z >>> 2025-12-04T09:17:30.0321508Z >>> with torch._library.fake_profile.unsafe_generate_fake_kernels(ep._report.op_profiles): 2025-12-04T09:17:30.0321877Z >>> decomp = ep.run_decompositions() 2025-12-04T09:17:30.0322081Z 2025-12-04T09:17:30.0322209Z 2025-12-04T09:17:30.0322656Z Original Error: IncompleteParseError('ill-formed doctest: all parts have been processed but the doctest source is not balanced') 2025-12-04T09:17:30.0330442Z 2025-12-04T09:17:30.0330668Z warnings.warn(msg) 2025-12-04T09:17:30.0330866Z 2025-12-04T09:17:30.0331123Z --- Parse Warning: 11 / 17 --- 2025-12-04T09:17:30.0332103Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=ActivationSparsifier in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/activation_sparsifier/activation_sparsifier.py line=16. 2025-12-04T09:17:30.0333250Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0333566Z 2025-12-04T09:17:30.0333826Z The Activation sparsifier class aims to sparsify/prune activations in a neural 2025-12-04T09:17:30.0334243Z network. The idea is to attach the sparsifier to a layer (or layers) and it 2025-12-04T09:17:30.0334656Z zeroes out the activations based on the mask_fn (or sparsification function) 2025-12-04T09:17:30.0334963Z input by the user. 2025-12-04T09:17:30.0335248Z The mask_fn is applied once all the inputs are aggregated and reduced i.e. 2025-12-04T09:17:30.0335592Z mask = mask_fn(reduce_fn(aggregate_fn(activations))) 2025-12-04T09:17:30.0335828Z 2025-12-04T09:17:30.0335963Z Note:: 2025-12-04T09:17:30.0336264Z The sparsification mask is computed on the input **before it goes through the attached layer**. 2025-12-04T09:17:30.0336612Z 2025-12-04T09:17:30.0336737Z Args: 2025-12-04T09:17:30.0336891Z model (nn.Module): 2025-12-04T09:17:30.0337159Z The model whose layers will be sparsified. The layers that needs to be 2025-12-04T09:17:30.0337546Z sparsified should be added separately using the register_layer() function 2025-12-04T09:17:30.0337871Z aggregate_fn (Optional, Callable): 2025-12-04T09:17:30.0338202Z default aggregate_fn that is used if not specified while registering the layer. 2025-12-04T09:17:30.0338573Z specifies how inputs should be aggregated over time. 2025-12-04T09:17:30.0338950Z The aggregate_fn should usually take 2 torch tensors and return the aggregated tensor. 2025-12-04T09:17:30.0339291Z Example 2025-12-04T09:17:30.0339611Z def add_agg_fn(tensor1, tensor2): return tensor1 + tensor2 2025-12-04T09:17:30.0339893Z reduce_fn (Optional, Callable): 2025-12-04T09:17:30.0340217Z default reduce_fn that is used if not specified while registering the layer. 2025-12-04T09:17:30.0340634Z reduce_fn will be called on the aggregated tensor i.e. the tensor obtained after 2025-12-04T09:17:30.0340979Z calling agg_fn() on all inputs. 2025-12-04T09:17:30.0341197Z Example 2025-12-04T09:17:30.0341502Z def mean_reduce_fn(agg_tensor): return agg_tensor.mean(dim=0) 2025-12-04T09:17:30.0341840Z mask_fn (Optional, Callable): 2025-12-04T09:17:30.0342199Z default mask_fn that is used to create the sparsification mask using the tensor obtained after 2025-12-04T09:17:30.0342678Z calling the reduce_fn(). This is used by default if a custom one is passed in the 2025-12-04T09:17:30.0343006Z register_layer(). 2025-12-04T09:17:30.0343383Z Note that the mask_fn() definition should contain the sparse arguments that is passed in sparse_config 2025-12-04T09:17:30.0343754Z arguments. 2025-12-04T09:17:30.0343955Z features (Optional, list): 2025-12-04T09:17:30.0344196Z default selected features to sparsify. 2025-12-04T09:17:30.0344543Z If this is non-empty, then the mask_fn will be applied for each feature of the input. 2025-12-04T09:17:30.0344874Z For example, 2025-12-04T09:17:30.0345217Z mask = [mask_fn(reduce_fn(aggregated_fn(input[feature])) for feature in features] 2025-12-04T09:17:30.0345548Z feature_dim (Optional, int): 2025-12-04T09:17:30.0345878Z default dimension of input features. Again, features along this dim will be chosen 2025-12-04T09:17:30.0346217Z for sparsification. 2025-12-04T09:17:30.0346436Z sparse_config (Dict): 2025-12-04T09:17:30.0346722Z Default configuration for the mask_fn. This config will be passed 2025-12-04T09:17:30.0347022Z with the mask_fn() 2025-12-04T09:17:30.0347229Z 2025-12-04T09:17:30.0347365Z Example: 2025-12-04T09:17:30.0347527Z >>> # xdoctest: +SKIP 2025-12-04T09:17:30.0347728Z >>> model = SomeModel() 2025-12-04T09:17:30.0348029Z >>> act_sparsifier = ActivationSparsifier(...) # init activation sparsifier 2025-12-04T09:17:30.0348339Z >>> # Initialize aggregate_fn 2025-12-04T09:17:30.0348547Z >>> def agg_fn(x, y): 2025-12-04T09:17:30.0348738Z >>> return x + y 2025-12-04T09:17:30.0348909Z >>> 2025-12-04T09:17:30.0349067Z >>> # Initialize reduce_fn 2025-12-04T09:17:30.0349261Z >>> def reduce_fn(x): 2025-12-04T09:17:30.0349447Z >>> return torch.mean(x, dim=0) 2025-12-04T09:17:30.0349651Z >>> 2025-12-04T09:17:30.0349801Z >>> # Initialize mask_fn 2025-12-04T09:17:30.0349987Z >>> def mask_fn(data): 2025-12-04T09:17:30.0350204Z >>> return torch.eye(data.shape).to(data.device) 2025-12-04T09:17:30.0350440Z >>> 2025-12-04T09:17:30.0350567Z >>> 2025-12-04T09:17:30.0350730Z >>> act_sparsifier.register_layer( 2025-12-04T09:17:30.0350959Z ... model.some_layer, 2025-12-04T09:17:30.0351153Z ... aggregate_fn=agg_fn, 2025-12-04T09:17:30.0351357Z ... reduce_fn=reduce_fn, 2025-12-04T09:17:30.0351551Z ... mask_fn=mask_fn, 2025-12-04T09:17:30.0351734Z ... ) 2025-12-04T09:17:30.0351872Z >>> 2025-12-04T09:17:30.0352023Z >>> # start training process 2025-12-04T09:17:30.0354561Z >>> for _ in [...]: 2025-12-04T09:17:30.0354816Z >>> # epoch starts 2025-12-04T09:17:30.0355067Z >>> # model.forward(), compute_loss() and model.backwards() 2025-12-04T09:17:30.0355324Z >>> # epoch ends 2025-12-04T09:17:30.0355577Z >>> act_sparsifier.step() 2025-12-04T09:17:30.0355799Z >>> # end training process 2025-12-04T09:17:30.0356015Z >>> sparsifier.squash_mask() 2025-12-04T09:17:30.0356221Z 2025-12-04T09:17:30.0356649Z Original Error: IndentationError("expected an indented block after 'for' statement on line 25", ('', 26, 1, '_._ = None\n', 26, 2)) 2025-12-04T09:17:30.0357115Z 2025-12-04T09:17:30.0357245Z _._ = None 2025-12-04T09:17:30.0357394Z ^ 2025-12-04T09:17:30.0357543Z warnings.warn(msg) 2025-12-04T09:17:30.0357710Z 2025-12-04T09:17:30.0358045Z --- Parse Warning: 12 / 17 --- 2025-12-04T09:17:30.0358891Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=DeviceMesh.__getitem__ in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py line=547. 2025-12-04T09:17:30.0359767Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0360076Z 2025-12-04T09:17:30.0362515Z Slice the current DeviceMesh based on the mesh_dim_names given to create a submesh. 2025-12-04T09:17:30.0363047Z The submesh created consists of the dimensions and the communicators indicated by 2025-12-04T09:17:30.0363390Z ``mesh_dim_names`` 2025-12-04T09:17:30.0363561Z 2025-12-04T09:17:30.0363694Z Args: 2025-12-04T09:17:30.0363944Z mesh_dim_names (Union[str, Tuple[str]]): the name or the tuple of names of the 2025-12-04T09:17:30.0364317Z mesh dimension of the DeviceMesh to create the submesh for. 2025-12-04T09:17:30.0364575Z Returns: 2025-12-04T09:17:30.0364743Z A :class:`DeviceMesh` object 2025-12-04T09:17:30.0364944Z 2025-12-04T09:17:30.0365231Z The following program runs on each process/rank in an SPMD manner in a world size of 8. 2025-12-04T09:17:30.0365578Z In the first example: 2025-12-04T09:17:30.0365867Z Calling mesh_2d["tp"] on rank 0, 1, 2, 3 returns a 1D submesh of DeviceMesh:([0, 1, 2, 3]). 2025-12-04T09:17:30.0366295Z Calling mesh_2d["tp"] on rank 4, 5, 6, 7 returns a 1D submesh of DeviceMesh:([4, 5, 6, 7]). 2025-12-04T09:17:30.0366680Z Calling mesh_2d["dp"] on rank 0, 4 returns a 1D submesh of DeviceMesh:([0, 4]). 2025-12-04T09:17:30.0367050Z Calling mesh_2d["dp"] on rank 1, 5 returns a 1D submesh of DeviceMesh:([1, 5]). 2025-12-04T09:17:30.0367415Z Calling mesh_2d["dp"] on rank 2, 6 returns a 1D submesh of DeviceMesh:([2, 6]). 2025-12-04T09:17:30.0367770Z Calling mesh_2d["dp"] on rank 3, 7 returns a 1D submesh of DeviceMesh:([3, 7]). 2025-12-04T09:17:30.0368044Z 2025-12-04T09:17:30.0368189Z In the second example: 2025-12-04T09:17:30.0368482Z Calling mesh_3d["dp", "cp"] on rank 0, 1, 4, 5 returns a 2D submesh of DeviceMesh:([[0, 1], [4, 5]]). 2025-12-04T09:17:30.0368906Z Calling mesh_3d["dp", "cp"] on rank 2, 3, 6, 7 returns a 2D submesh of DeviceMesh:([[2, 3], [6, 7]]). 2025-12-04T09:17:30.0369341Z Calling mesh_3d["cp", "dp"] on rank 0, 1, 4, 5 returns a 2D submesh of DeviceMesh:([[0, 4], [1, 5]]). 2025-12-04T09:17:30.0369780Z Calling mesh_3d["cp", "dp"] on rank 2, 3, 6, 7 returns a 2D submesh of DeviceMesh:([[2, 6], [3, 7]]). 2025-12-04T09:17:30.0370079Z 2025-12-04T09:17:30.0370219Z Example:: 2025-12-04T09:17:30.0370365Z 2025-12-04T09:17:30.0370507Z >>> # xdoctest: +SKIP("no rank") 2025-12-04T09:17:30.0370775Z >>> from torch.distributed.device_mesh import DeviceMesh 2025-12-04T09:17:30.0371034Z >>> 2025-12-04T09:17:30.0371258Z >>> # Initialize a 2D device mesh as (2, 4) to represent the topology 2025-12-04T09:17:30.0371560Z >>> # of cross-host(dim 0), and within-host (dim 1). 2025-12-04T09:17:30.0372045Z >>> mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp")) 2025-12-04T09:17:30.0372441Z >>> tp_mesh = mesh_2d["tp"] 2025-12-04T09:17:30.0372705Z >>> dp_mesh = mesh_2d["dp"] 2025-12-04T09:17:30.0372891Z >>> 2025-12-04T09:17:30.0373042Z >>> # Initialize a 3D mesh. 2025-12-04T09:17:30.0373351Z >>> mesh_3d = init_device_mesh(device_type="cuda", (2,2,2), mesh_dim_names=("dp", "pp", "cp")) 2025-12-04T09:17:30.0373819Z >>> # The order of the mesh_dim_names provided deteremines the order of dimensions in the submesh. 2025-12-04T09:17:30.0374179Z >>> dp_cp_mesh = mesh_3d["dp", "cp"] 2025-12-04T09:17:30.0374405Z >>> cp_dp_mesh = mesh_3d["cp", "dp"] 2025-12-04T09:17:30.0374598Z 2025-12-04T09:17:30.0375147Z Original Error: SyntaxError('positional argument follows keyword argument', ('', 6, 82, 'mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp"))\n', 6, 83)) 2025-12-04T09:17:30.0375793Z 2025-12-04T09:17:30.0376026Z mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp")) 2025-12-04T09:17:30.0376353Z ^ 2025-12-04T09:17:30.0376586Z warnings.warn(msg) 2025-12-04T09:17:30.0376757Z 2025-12-04T09:17:30.0377074Z --- Parse Warning: 13 / 17 --- 2025-12-04T09:17:30.0377880Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=SavePlanner in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/planner.py line=122. 2025-12-04T09:17:30.0378751Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0379050Z 2025-12-04T09:17:30.0379321Z Abstract class defining the protocol used by save_state_dict to plan the save process. 2025-12-04T09:17:30.0379649Z 2025-12-04T09:17:30.0379921Z SavePlanners are stateful objects that can be used to customize the whole save process. 2025-12-04T09:17:30.0380244Z 2025-12-04T09:17:30.0380501Z SavePlanner acts as an access proxy to the state_dict, so any transformation done to it 2025-12-04T09:17:30.0380861Z will be visible to the whole process. 2025-12-04T09:17:30.0381073Z 2025-12-04T09:17:30.0381336Z A planner subclass can expect the following sequence of calls during save_state_dict: 2025-12-04T09:17:30.0381653Z 2025-12-04T09:17:30.0381804Z 1) set_up_planner - called on all ranks. 2025-12-04T09:17:30.0382044Z Signals the start of a checkpoint save. 2025-12-04T09:17:30.0382256Z 2025-12-04T09:17:30.0382406Z 2) create_local_plan - called on all ranks. 2025-12-04T09:17:30.0382754Z Process the state_dict and produces a `SavePlan` that will be sent for global planning. 2025-12-04T09:17:30.0383076Z 2025-12-04T09:17:30.0383275Z 3) create_global_plan - called on the coordinator rank only. 2025-12-04T09:17:30.0383601Z Takes the SavePlan from all ranks and make any global decision. 2025-12-04T09:17:30.0383862Z 2025-12-04T09:17:30.0384008Z 4) finish_plan - called on all ranks. 2025-12-04T09:17:30.0384293Z This gives each rank a chance to adjust to global planning decisions. 2025-12-04T09:17:30.0384568Z 2025-12-04T09:17:30.0384747Z 5) resolve_data - called multiple times on each rank 2025-12-04T09:17:30.0385055Z Lookups a value on the `state_dict` for the storage layer to write. 2025-12-04T09:17:30.0385319Z 2025-12-04T09:17:30.0385595Z Users are recommended to extend DefaultSavePlanner instead of this interface directly as 2025-12-04T09:17:30.0386017Z most changes can be expressed by changes in a single method. 2025-12-04T09:17:30.0386264Z 2025-12-04T09:17:30.0386427Z There are 3 usual patterns of extension: 2025-12-04T09:17:30.0386634Z 2025-12-04T09:17:30.0386939Z Rewriting state_dict. This is the simplest way to extend the save process as it 2025-12-04T09:17:30.0387348Z doesn't requite understanding the intrincacies of how SavePlan works: 2025-12-04T09:17:30.0387634Z 2025-12-04T09:17:30.0387782Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:30.0388063Z >>> class RenamePlanner(DefaultSavePlanner): 2025-12-04T09:17:30.0388299Z >>> def set_up_planner( 2025-12-04T09:17:30.0388481Z >>> self, 2025-12-04T09:17:30.0388665Z >>> state_dict: STATE_DICT_TYPE, 2025-12-04T09:17:30.0388900Z >>> storage_meta: Optional[StorageMeta], 2025-12-04T09:17:30.0389132Z >>> is_coordinator: bool, 2025-12-04T09:17:30.0389325Z >>> ) -> None: 2025-12-04T09:17:30.0389504Z >>> # prefix all keys with `foo_`` 2025-12-04T09:17:30.0389852Z >>> super().set_up_planner({"foo_" + k: v for k, v in state_dict.items()}, storage_meta, is_coordinator) 2025-12-04T09:17:30.0390210Z 2025-12-04T09:17:30.0390516Z Modifying local plan and lookup in tandem. This is useful when fine control of how data is persisted 2025-12-04T09:17:30.0390883Z 2025-12-04T09:17:30.0391027Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:30.0391270Z >>> class FP16Planner(DefaultSavePlanner): 2025-12-04T09:17:30.0391504Z >>> def create_local_plan(self): 2025-12-04T09:17:30.0391735Z >>> plan = super().create_local_plan() 2025-12-04T09:17:30.0392007Z >>> for p in plan: 2025-12-04T09:17:30.0392227Z >>> if p.tensor_data is not None: 2025-12-04T09:17:30.0392495Z >>> p.tensor_data.properties.dtype = torch.float16 2025-12-04T09:17:30.0392739Z >>> return plan 2025-12-04T09:17:30.0392908Z >>> 2025-12-04T09:17:30.0393064Z >>> def resolve_data(self, write_item): 2025-12-04T09:17:30.0393297Z >>> item = super().resolve_data(write_item) 2025-12-04T09:17:30.0393645Z >>> return item if write_item.type == WriteItemType.BYTE_IO else item.to(torch.float16) 2025-12-04T09:17:30.0393972Z 2025-12-04T09:17:30.0394282Z Using the global planning step to make central decisions that can't be made individually by each rank 2025-12-04T09:17:30.0394631Z 2025-12-04T09:17:30.0394785Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:30.0395007Z >>> from itertools import zip_longest 2025-12-04T09:17:30.0395224Z >>> from dataclasses import replace 2025-12-04T09:17:30.0395481Z >>> class DDPLoadBalancingPlanner(DefaultSavePlanner): 2025-12-04T09:17:30.0395852Z >>> # This uses the default local plan behavior of having all non-sharded writes in rank 0 2025-12-04T09:17:30.0396197Z >>> # This sample doesn't handle ShardedTensors 2025-12-04T09:17:30.0396450Z >>> def create_global_plan(self, all_plans): 2025-12-04T09:17:30.0396727Z >>> iters = [iter(all_plans[0].items)] * len(all_plans) 2025-12-04T09:17:30.0396981Z >>> items_per_rank = [ 2025-12-04T09:17:30.0397203Z >>> [item for item in items if item is not None] 2025-12-04T09:17:30.0397482Z >>> for items in zip(*zip_longest(*iters), strict=True) 2025-12-04T09:17:30.0397719Z >>> ] 2025-12-04T09:17:30.0397871Z >>> all_plans = [ 2025-12-04T09:17:30.0398073Z >>> replace(plan, items=items) 2025-12-04T09:17:30.0398354Z >>> for plan, items in zip(all_plans, items_per_rank, strict=True) 2025-12-04T09:17:30.0398613Z >>> ] 2025-12-04T09:17:30.0398803Z >>> return super().create_global_plan(all_plans) 2025-12-04T09:17:30.0399031Z 2025-12-04T09:17:30.0399283Z Finally, some planners need to save additional metadata in the checkpoint, this is 2025-12-04T09:17:30.0399719Z accomplished by having each rank contribute their data items in the local plan and 2025-12-04T09:17:30.0400069Z the global planner aggregate them: 2025-12-04T09:17:30.0400273Z 2025-12-04T09:17:30.0400421Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:30.0400730Z >>> class SaveExtraDataPlanner(DefaultSavePlanner): 2025-12-04T09:17:30.0401008Z >>> def create_local_plan(self) -> SavePlan: 2025-12-04T09:17:30.0401247Z >>> plan = super().create_local_plan() 2025-12-04T09:17:30.0401551Z >>> return replace(plan, planner_data="per-rank-data") 2025-12-04T09:17:30.0401794Z >>> 2025-12-04T09:17:30.0402072Z >>> def create_global_plan(self, all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]: 2025-12-04T09:17:30.0402484Z >>> global_plan, metadata = super().create_global_plan(all_plans) 2025-12-04T09:17:30.0402941Z >>> merged_data = [p.planner_data for p in global_plan] 2025-12-04T09:17:30.0403267Z >>> metadata = replace(metadata, planner_data=merged_data) 2025-12-04T09:17:30.0403534Z >>> return global_plan, metadata 2025-12-04T09:17:30.0403784Z 2025-12-04T09:17:30.0404217Z Original Error: IndentationError('expected an indented block after function definition on line 3', ('', 9, 0, '_._ = None\n', 9, -1)) 2025-12-04T09:17:30.0404689Z 2025-12-04T09:17:30.0404819Z _._ = None 2025-12-04T09:17:30.0404965Z ^ 2025-12-04T09:17:30.0405103Z warnings.warn(msg) 2025-12-04T09:17:30.0405279Z 2025-12-04T09:17:30.0405495Z --- Parse Warning: 14 / 17 --- 2025-12-04T09:17:30.0406346Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=LoadPlanner in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/planner.py line=305. 2025-12-04T09:17:30.0407211Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0407506Z 2025-12-04T09:17:30.0407770Z Abstract class defining the protocol used by load_state_dict to plan the load process. 2025-12-04T09:17:30.0408102Z 2025-12-04T09:17:30.0408365Z LoadPlanner are stateful objects that can be used to customize the whole load process. 2025-12-04T09:17:30.0408686Z 2025-12-04T09:17:30.0408948Z LoadPlanner acts as an access proxy to the state_dict, so any transformation done to it 2025-12-04T09:17:30.0409281Z will be visible to the whole process. 2025-12-04T09:17:30.0409485Z 2025-12-04T09:17:30.0409740Z A planner subclass can expect the following sequence of calls during load_state_dict: 2025-12-04T09:17:30.0410050Z 2025-12-04T09:17:30.0410206Z 1) set_up_planner - called on all ranks. 2025-12-04T09:17:30.0410450Z Signals the start of loading a checkpoint. 2025-12-04T09:17:30.0410662Z 2025-12-04T09:17:30.0410820Z 2) create_local_plan - called on all ranks. 2025-12-04T09:17:30.0411171Z Process the state_dict and produces a `LoadPlan` that will be sent for global planning. 2025-12-04T09:17:30.0411493Z 2025-12-04T09:17:30.0411684Z 3) create_global_plan - called on the coordinator rank only. 2025-12-04T09:17:30.0412013Z Takes the LoadPlan from all ranks and make any global decision. 2025-12-04T09:17:30.0412282Z 2025-12-04T09:17:30.0412506Z 4) load_bytes - called multiple times on each rank 2025-12-04T09:17:30.0412809Z This is called once per non-tensor value in state_dict. 2025-12-04T09:17:30.0412864Z 2025-12-04T09:17:30.0413016Z 5) resolve_tensor and commit_tensor - called multiple times on each rank 2025-12-04T09:17:30.0413140Z They are called in pair for each Tensor value in state_dict. 2025-12-04T09:17:30.0413193Z 2025-12-04T09:17:30.0413401Z Users are recommended to extend DefaultLoadPlanner instead of this interface directly as 2025-12-04T09:17:30.0413524Z most changes can be expressed by changes in a single method. 2025-12-04T09:17:30.0413583Z 2025-12-04T09:17:30.0413670Z There are two usual patterns of extension: 2025-12-04T09:17:30.0413723Z 2025-12-04T09:17:30.0413904Z Rewriting state_dict. This is the simplest way to extend the load process as it 2025-12-04T09:17:30.0414122Z doesn't requite understanding the intrincacies of how LoadPlan works. We need 2025-12-04T09:17:30.0414278Z to keep a reference to the original state_dict as load happens in place so 2025-12-04T09:17:30.0414369Z we need to be able to perform it in place 2025-12-04T09:17:30.0414456Z 2025-12-04T09:17:30.0414534Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:30.0414643Z >>> class RenamePlanner(DefaultLoadPlanner): 2025-12-04T09:17:30.0414714Z >>> def set_up_planner( 2025-12-04T09:17:30.0414776Z >>> self, 2025-12-04T09:17:30.0414851Z >>> state_dict: STATE_DICT_TYPE, 2025-12-04T09:17:30.0414918Z >>> metadata: Metadata, 2025-12-04T09:17:30.0414996Z >>> is_coordinator: bool, 2025-12-04T09:17:30.0415055Z >>> ) -> None: 2025-12-04T09:17:30.0415144Z >>> self.original_state_dict = state_dict 2025-12-04T09:17:30.0415316Z >>> state_dict = {"foo_" + k: v for k, v in state_dict.items()} 2025-12-04T09:17:30.0415372Z >>> 2025-12-04T09:17:30.0415456Z >>> if self.flatten_sharded_tensors: 2025-12-04T09:17:30.0415568Z >>> state_dict = _flatten_sharded_tensors(state_dict) 2025-12-04T09:17:30.0415622Z >>> 2025-12-04T09:17:30.0415706Z >>> if self.flatten_state_dict: 2025-12-04T09:17:30.0415829Z >>> state_dict, self.mappings = flatten_state_dict(state_dict) 2025-12-04T09:17:30.0415923Z >>> 2025-12-04T09:17:30.0416004Z >>> self.state_dict = state_dict 2025-12-04T09:17:30.0416078Z >>> self.metadata = metadata 2025-12-04T09:17:30.0416162Z >>> self.is_coordinator = is_coordinator 2025-12-04T09:17:30.0416225Z >>> 2025-12-04T09:17:30.0416310Z >>> def load_bytes(self, read_item, value): 2025-12-04T09:17:30.0416382Z >>> # Remove the "foo_" prefix 2025-12-04T09:17:30.0416603Z >>> self.original_state_dict[read_item.dest_index.fqn[4:]] = torch.load(value, weights_only=False) 2025-12-04T09:17:30.0416658Z 2025-12-04T09:17:30.0416712Z 2025-12-04T09:17:30.0416897Z Modifying resolve_tensor and commit_tensor to handle load time transformation. 2025-12-04T09:17:30.0416950Z 2025-12-04T09:17:30.0417034Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T09:17:30.0417146Z >>> class MetaModelMaterialize(DefaultSavePlanner): 2025-12-04T09:17:30.0417227Z >>> def resolve_tensor(self, read_item): 2025-12-04T09:17:30.0417335Z >>> tensor = super().resolve_tensor(read_item) 2025-12-04T09:17:30.0417443Z >>> return torch.empty_like(tensor, device="cpu") 2025-12-04T09:17:30.0417498Z >>> 2025-12-04T09:17:30.0417591Z >>> def commit_tensor(self, read_item, tensor): 2025-12-04T09:17:30.0417699Z >>> self.state_dict[read_item.dest_index.fqn] = tensor 2025-12-04T09:17:30.0417752Z 2025-12-04T09:17:30.0418117Z Original Error: IndentationError('expected an indented block after function definition on line 22', ('', 23, 0, '_._ = None\n', 23, -1)) 2025-12-04T09:17:30.0418173Z 2025-12-04T09:17:30.0418234Z _._ = None 2025-12-04T09:17:30.0418288Z ^ 2025-12-04T09:17:30.0418354Z warnings.warn(msg) 2025-12-04T09:17:30.0418410Z 2025-12-04T09:17:30.0418547Z --- Parse Warning: 15 / 17 --- 2025-12-04T09:17:30.0419181Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=FullStateDictConfig in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py line=295. 2025-12-04T09:17:30.0419352Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0419404Z 2025-12-04T09:17:30.0419547Z ``FullStateDictConfig`` is a config class meant to be used with 2025-12-04T09:17:30.0419676Z ``StateDictType.FULL_STATE_DICT``. We recommend enabling both 2025-12-04T09:17:30.0419815Z ``offload_to_cpu=True`` and ``rank0_only=True`` when saving full state 2025-12-04T09:17:30.0420033Z dicts to save GPU memory and CPU memory, respectively. This config class 2025-12-04T09:17:30.0420167Z is meant to be used via the :func:`state_dict_type` context manager as 2025-12-04T09:17:30.0420226Z follows: 2025-12-04T09:17:30.0420317Z 2025-12-04T09:17:30.0420404Z >>> # xdoctest: +SKIP("undefined variables") 2025-12-04T09:17:30.0420564Z >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP 2025-12-04T09:17:30.0420653Z >>> fsdp = FSDP(model, auto_wrap_policy=...) 2025-12-04T09:17:30.0420789Z >>> cfg = FullStateDictConfig(offload_to_cpu=True, rank0_only=True) 2025-12-04T09:17:30.0420944Z >>> with FSDP.state_dict_type(fsdp, StateDictType.FULL_STATE_DICT, cfg): 2025-12-04T09:17:30.0421021Z >>> state = fsdp.state_dict() 2025-12-04T09:17:30.0421164Z >>> # `state` will be empty on non rank 0 and contain CPU tensors on rank 0. 2025-12-04T09:17:30.0421381Z >>> # To reload checkpoint for inference, finetuning, transfer learning, etc: 2025-12-04T09:17:30.0421539Z >>> model = model_fn() # Initialize model in preparation for wrapping with FSDP 2025-12-04T09:17:30.0421607Z >>> if dist.get_rank() == 0: 2025-12-04T09:17:30.0421736Z >>> # Load checkpoint only on rank 0 to avoid memory redundancy 2025-12-04T09:17:30.0421834Z >>> state_dict = torch.load("my_checkpoint.pt") 2025-12-04T09:17:30.0421960Z >>> model.load_state_dict(state_dict) 2025-12-04T09:17:30.0422130Z >>> # All ranks initialize FSDP module as usual. `sync_module_states` argument 2025-12-04T09:17:30.0422288Z >>> # communicates loaded checkpoint states from rank 0 to rest of the world. 2025-12-04T09:17:30.0422360Z >>> fsdp = FSDP( 2025-12-04T09:17:30.0422419Z ... model, 2025-12-04T09:17:30.0422510Z ... device_id=torch.cuda.current_device(), 2025-12-04T09:17:30.0422588Z ... auto_wrap_policy=..., 2025-12-04T09:17:30.0422660Z ... sync_module_states=True, 2025-12-04T09:17:30.0422721Z ... ) 2025-12-04T09:17:30.0422861Z >>> # After this point, all ranks have FSDP model with loaded checkpoint. 2025-12-04T09:17:30.0422914Z 2025-12-04T09:17:30.0422977Z Attributes: 2025-12-04T09:17:30.0423112Z rank0_only (bool): If ``True``, then only rank 0 saves the full state 2025-12-04T09:17:30.0423250Z dict, and nonzero ranks save an empty dict. If ``False``, then all 2025-12-04T09:17:30.0423363Z ranks save the full state dict. (Default: ``False``) 2025-12-04T09:17:30.0423416Z 2025-12-04T09:17:30.0423747Z Original Error: IndentationError("expected an indented block after 'if' statement on line 10", ('', 11, 1, '_._ = None\n', 11, 2)) 2025-12-04T09:17:30.0423805Z 2025-12-04T09:17:30.0423860Z _._ = None 2025-12-04T09:17:30.0423919Z ^ 2025-12-04T09:17:30.0423987Z warnings.warn(msg) 2025-12-04T09:17:30.0424039Z 2025-12-04T09:17:30.0424172Z --- Parse Warning: 16 / 17 --- 2025-12-04T09:17:30.0424818Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=register_parametrization in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrize.py line=437. 2025-12-04T09:17:30.0424983Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0425099Z Register a parametrization to a tensor in a module. 2025-12-04T09:17:30.0425153Z 2025-12-04T09:17:30.0425344Z Assume that ``tensor_name="weight"`` for simplicity. When accessing ``module.weight``, 2025-12-04T09:17:30.0425530Z the module will return the parametrized version ``parametrization(module.weight)``. 2025-12-04T09:17:30.0425705Z If the original tensor requires a gradient, the backward pass will differentiate 2025-12-04T09:17:30.0425904Z through :attr:`parametrization`, and the optimizer will update the tensor accordingly. 2025-12-04T09:17:30.0425955Z 2025-12-04T09:17:30.0426210Z The first time that a module registers a parametrization, this function will add an attribute 2025-12-04T09:17:30.0426381Z ``parametrizations`` to the module of type :class:`~ParametrizationList`. 2025-12-04T09:17:30.0426468Z 2025-12-04T09:17:30.0426643Z The list of parametrizations on the tensor ``weight`` will be accessible under 2025-12-04T09:17:30.0426735Z ``module.parametrizations.weight``. 2025-12-04T09:17:30.0426788Z 2025-12-04T09:17:30.0426885Z The original tensor will be accessible under 2025-12-04T09:17:30.0426987Z ``module.parametrizations.weight.original``. 2025-12-04T09:17:30.0427039Z 2025-12-04T09:17:30.0427223Z Parametrizations may be concatenated by registering several parametrizations 2025-12-04T09:17:30.0427292Z on the same attribute. 2025-12-04T09:17:30.0427380Z 2025-12-04T09:17:30.0427547Z The training mode of a registered parametrization is updated on registration 2025-12-04T09:17:30.0427639Z to match the training mode of the host module 2025-12-04T09:17:30.0427699Z 2025-12-04T09:17:30.0427903Z Parametrized parameters and buffers have an inbuilt caching system that can be activated 2025-12-04T09:17:30.0427993Z using the context manager :func:`cached`. 2025-12-04T09:17:30.0428066Z 2025-12-04T09:17:30.0428269Z A :attr:`parametrization` may optionally implement a method with signature 2025-12-04T09:17:30.0428324Z 2025-12-04T09:17:30.0428403Z .. code-block:: python 2025-12-04T09:17:30.0428456Z 2025-12-04T09:17:30.0428601Z def right_inverse(self, X: Tensor) -> Union[Tensor, Sequence[Tensor]] 2025-12-04T09:17:30.0428661Z 2025-12-04T09:17:30.0428838Z This method is called on the unparametrized tensor when the first parametrization 2025-12-04T09:17:30.0428988Z is registered to compute the initial value of the original tensor. 2025-12-04T09:17:30.0429195Z If this method is not implemented, the original tensor will be just the unparametrized tensor. 2025-12-04T09:17:30.0429250Z 2025-12-04T09:17:30.0429463Z If all the parametrizations registered on a tensor implement `right_inverse` it is possible 2025-12-04T09:17:30.0429656Z to initialize a parametrized tensor by assigning to it, as shown in the example below. 2025-12-04T09:17:30.0429709Z 2025-12-04T09:17:30.0429867Z It is possible for the first parametrization to depend on several inputs. 2025-12-04T09:17:30.0430025Z This may be implemented returning a tuple of tensors from ``right_inverse`` 2025-12-04T09:17:30.0430186Z (see the example implementation of a ``RankOne`` parametrization below). 2025-12-04T09:17:30.0430240Z 2025-12-04T09:17:30.0430466Z In this case, the unconstrained tensors are also located under ``module.parametrizations.weight`` 2025-12-04T09:17:30.0430560Z with names ``original0``, ``original1``,... 2025-12-04T09:17:30.0430613Z 2025-12-04T09:17:30.0430670Z .. note:: 2025-12-04T09:17:30.0430733Z 2025-12-04T09:17:30.0430912Z If unsafe=False (default) both the forward and right_inverse methods will be called 2025-12-04T09:17:30.0431012Z once to perform a number of consistency checks. 2025-12-04T09:17:30.0431196Z If unsafe=True, then right_inverse will be called if the tensor is not parametrized, 2025-12-04T09:17:30.0431277Z and nothing will be called otherwise. 2025-12-04T09:17:30.0431333Z 2025-12-04T09:17:30.0431391Z .. note:: 2025-12-04T09:17:30.0431442Z 2025-12-04T09:17:30.0431583Z In most situations, ``right_inverse`` will be a function such that 2025-12-04T09:17:30.0431673Z ``forward(right_inverse(X)) == X`` (see 2025-12-04T09:17:30.0431880Z `right inverse `_). 2025-12-04T09:17:30.0432061Z Sometimes, when the parametrization is not surjective, it may be reasonable 2025-12-04T09:17:30.0432172Z to relax this. 2025-12-04T09:17:30.0432229Z 2025-12-04T09:17:30.0432294Z .. warning:: 2025-12-04T09:17:30.0432348Z 2025-12-04T09:17:30.0432539Z If a parametrization depends on several inputs, :func:`~register_parametrization` 2025-12-04T09:17:30.0432748Z will register a number of new parameters. If such parametrization is registered 2025-12-04T09:17:30.0432931Z after the optimizer is created, these new parameters will need to be added manually 2025-12-04T09:17:30.0433069Z to the optimizer. See :meth:`torch.Optimizer.add_param_group`. 2025-12-04T09:17:30.0433123Z 2025-12-04T09:17:30.0433178Z Args: 2025-12-04T09:17:30.0433325Z module (nn.Module): module on which to register the parametrization 2025-12-04T09:17:30.0433469Z tensor_name (str): name of the parameter or buffer on which to register 2025-12-04T09:17:30.0433580Z the parametrization 2025-12-04T09:17:30.0433720Z parametrization (nn.Module): the parametrization to register 2025-12-04T09:17:30.0433781Z Keyword args: 2025-12-04T09:17:30.0433931Z unsafe (bool): a boolean flag that denotes whether the parametrization 2025-12-04T09:17:30.0434061Z may change the dtype and shape of the tensor. Default: `False` 2025-12-04T09:17:30.0434277Z Warning: the parametrization is not checked for consistency upon registration. 2025-12-04T09:17:30.0434364Z Enable this flag at your own risk. 2025-12-04T09:17:30.0434417Z 2025-12-04T09:17:30.0434474Z Raises: 2025-12-04T09:17:30.0434671Z ValueError: if the module does not have a parameter or a buffer named :attr:`tensor_name` 2025-12-04T09:17:30.0434723Z 2025-12-04T09:17:30.0434788Z Examples: 2025-12-04T09:17:30.0434891Z >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK) 2025-12-04T09:17:30.0434957Z >>> import torch 2025-12-04T09:17:30.0435043Z >>> import torch.nn as nn 2025-12-04T09:17:30.0435144Z >>> import torch.nn.utils.parametrize as P 2025-12-04T09:17:30.0435202Z >>> 2025-12-04T09:17:30.0435286Z >>> class Symmetric(nn.Module): 2025-12-04T09:17:30.0435358Z >>> def forward(self, X): 2025-12-04T09:17:30.0435481Z >>> return X.triu() + X.triu(1).T # Return a symmetric matrix 2025-12-04T09:17:30.0435541Z >>> 2025-12-04T09:17:30.0435620Z >>> def right_inverse(self, A): 2025-12-04T09:17:30.0435695Z >>> return A.triu() 2025-12-04T09:17:30.0435749Z >>> 2025-12-04T09:17:30.0435817Z >>> m = nn.Linear(5, 5) 2025-12-04T09:17:30.0435943Z >>> P.register_parametrization(m, "weight", Symmetric()) 2025-12-04T09:17:30.0436107Z >>> print(torch.allclose(m.weight, m.weight.T)) # m.weight is now symmetric 2025-12-04T09:17:30.0436163Z True 2025-12-04T09:17:30.0436239Z >>> A = torch.rand(5, 5) 2025-12-04T09:17:30.0436315Z >>> A = A + A.T # A is now symmetric 2025-12-04T09:17:30.0436448Z >>> m.weight = A # Initialize the weight to be the symmetric matrix A 2025-12-04T09:17:30.0436543Z >>> print(torch.allclose(m.weight, A)) 2025-12-04T09:17:30.0436598Z True 2025-12-04T09:17:30.0436658Z 2025-12-04T09:17:30.0436731Z >>> class RankOne(nn.Module): 2025-12-04T09:17:30.0436811Z >>> def forward(self, x, y): 2025-12-04T09:17:30.0436916Z >>> # Form a rank 1 matrix multiplying two vectors 2025-12-04T09:17:30.0437008Z >>> return x.unsqueeze(-1) @ y.unsqueeze(-2) 2025-12-04T09:17:30.0437063Z >>> 2025-12-04T09:17:30.0437147Z >>> def right_inverse(self, Z): 2025-12-04T09:17:30.0437232Z >>> # Project Z onto the rank 1 matrices 2025-12-04T09:17:30.0437338Z >>> U, S, Vh = torch.linalg.svd(Z, full_matrices=False) 2025-12-04T09:17:30.0437472Z >>> # Return rescaled singular vectors 2025-12-04T09:17:30.0437559Z >>> s0_sqrt = S[0].sqrt().unsqueeze(-1) 2025-12-04T09:17:30.0437665Z >>> return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt 2025-12-04T09:17:30.0437807Z >>> 2025-12-04T09:17:30.0437911Z >>> linear_rank_one = P.register_parametrization( 2025-12-04T09:17:30.0438012Z ... nn.Linear(4, 4), "weight", RankOne() 2025-12-04T09:17:30.0438074Z ... ) 2025-12-04T09:17:30.0438214Z >>> print(torch.linalg.matrix_rank(linear_rank_one.weight).item()) 2025-12-04T09:17:30.0438275Z 1 2025-12-04T09:17:30.0438327Z 2025-12-04T09:17:30.0438380Z 2025-12-04T09:17:30.0438742Z Original Error: IndentationError('expected an indented block after function definition on line 2', ('', 3, 0, '_._ = None\n', 3, -1)) 2025-12-04T09:17:30.0438834Z 2025-12-04T09:17:30.0438899Z _._ = None 2025-12-04T09:17:30.0438955Z ^ 2025-12-04T09:17:30.0439024Z warnings.warn(msg) 2025-12-04T09:17:30.0439083Z 2025-12-04T09:17:30.0439215Z --- Parse Warning: 17 / 17 --- 2025-12-04T09:17:30.0439874Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=ReduceLROnPlateau in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py line=1586. 2025-12-04T09:17:30.0440058Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T09:17:30.0440180Z Reduce learning rate when a metric has stopped improving. 2025-12-04T09:17:30.0440232Z 2025-12-04T09:17:30.0440374Z Models often benefit from reducing the learning rate by a factor 2025-12-04T09:17:30.0440499Z of 2-10 once learning stagnates. This scheduler reads a metrics 2025-12-04T09:17:30.0440636Z quantity and if no improvement is seen for a 'patience' number 2025-12-04T09:17:30.0440724Z of epochs, the learning rate is reduced. 2025-12-04T09:17:30.0440774Z 2025-12-04T09:17:30.0440834Z Args: 2025-12-04T09:17:30.0440924Z optimizer (Optimizer): Wrapped optimizer. 2025-12-04T09:17:30.0441031Z mode (str): One of `min`, `max`. In `min` mode, lr will 2025-12-04T09:17:30.0441142Z be reduced when the quantity monitored has stopped 2025-12-04T09:17:30.0441258Z decreasing; in `max` mode it will be reduced when the 2025-12-04T09:17:30.0441404Z quantity monitored has stopped increasing. Default: 'min'. 2025-12-04T09:17:30.0441520Z factor (float): Factor by which the learning rate will be 2025-12-04T09:17:30.0441615Z reduced. new_lr = lr * factor. Default: 0.1. 2025-12-04T09:17:30.0441765Z patience (int): The number of allowed epochs with no improvement after 2025-12-04T09:17:30.0441854Z which the learning rate will be reduced. 2025-12-04T09:17:30.0442009Z For example, consider the case of having no patience (`patience = 0`). 2025-12-04T09:17:30.0442257Z In the first epoch, a baseline is established and is always considered good as there's no previous baseline. 2025-12-04T09:17:30.0442401Z In the second epoch, if the performance is worse than the baseline, 2025-12-04T09:17:30.0442508Z we have what is considered an intolerable epoch. 2025-12-04T09:17:30.0442687Z Since the count of intolerable epochs (1) is greater than the patience level (0), 2025-12-04T09:17:30.0442800Z the learning rate is reduced at the end of this epoch. 2025-12-04T09:17:30.0443019Z From the third epoch onwards, the learning rate continues to be reduced at the end of each epoch 2025-12-04T09:17:30.0443226Z if the performance is worse than the baseline. If the performance improves or remains the same, 2025-12-04T09:17:30.0443315Z the learning rate is not adjusted. 2025-12-04T09:17:30.0443426Z Default: 10. 2025-12-04T09:17:30.0443556Z threshold (float): Threshold for measuring the new optimum, 2025-12-04T09:17:30.0443668Z to only focus on significant changes. Default: 1e-4. 2025-12-04T09:17:30.0443834Z threshold_mode (str): One of `rel`, `abs`. In `rel` mode, 2025-12-04T09:17:30.0443940Z dynamic_threshold = best * ( 1 + threshold ) in 'max' 2025-12-04T09:17:30.0444042Z mode or best * ( 1 - threshold ) in `min` mode. 2025-12-04T09:17:30.0444143Z In `abs` mode, dynamic_threshold = best + threshold in 2025-12-04T09:17:30.0444272Z `max` mode or best - threshold in `min` mode. Default: 'rel'. 2025-12-04T09:17:30.0444394Z cooldown (int): Number of epochs to wait before resuming 2025-12-04T09:17:30.0444548Z normal operation after lr has been reduced. Default: 0. 2025-12-04T09:17:30.0444662Z min_lr (float or list): A scalar or a list of scalars. A 2025-12-04T09:17:30.0444766Z lower bound on the learning rate of all param groups 2025-12-04T09:17:30.0444856Z or each group respectively. Default: 0. 2025-12-04T09:17:30.0444982Z eps (float): Minimal decay applied to lr. If the difference 2025-12-04T09:17:30.0445140Z between new and old lr is smaller than eps, the update is 2025-12-04T09:17:30.0445214Z ignored. Default: 1e-8. 2025-12-04T09:17:30.0445271Z 2025-12-04T09:17:30.0445328Z Example: 2025-12-04T09:17:30.0445402Z >>> # xdoctest: +SKIP 2025-12-04T09:17:30.0445557Z >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) 2025-12-04T09:17:30.0445674Z >>> scheduler = ReduceLROnPlateau(optimizer, "min") 2025-12-04T09:17:30.0445755Z >>> for epoch in range(10): 2025-12-04T09:17:30.0445821Z >>> train(...) 2025-12-04T09:17:30.0445900Z >>> val_loss = validate(...) 2025-12-04T09:17:30.0446008Z >>> # Note that step should be called after validate() 2025-12-04T09:17:30.0446084Z >>> scheduler.step(val_loss) 2025-12-04T09:17:30.0446138Z 2025-12-04T09:17:30.0446277Z .. image:: ../scripts/lr_scheduler_images/ReduceLROnPlateau.png 2025-12-04T09:17:30.0446331Z 2025-12-04T09:17:30.0446622Z Original Error: IndentationError('unexpected indent', ('', 8, 4, ' scheduler.step(val_loss)\n', 8, -1)) 2025-12-04T09:17:30.0446675Z 2025-12-04T09:17:30.0446750Z scheduler.step(val_loss) 2025-12-04T09:17:30.0446810Z ^ 2025-12-04T09:17:30.0446876Z warnings.warn(msg) 2025-12-04T09:17:30.0446930Z 2025-12-04T09:17:30.0447025Z  2025-12-04T09:17:30.0447149Z === Found 10 run-time warnings === 2025-12-04T09:17:30.0447273Z --- Runtime Warning: 1 / 10 --- 2025-12-04T09:17:30.0447414Z example = 2025-12-04T09:17:30.0448387Z :3: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:17:30.0448452Z 2025-12-04T09:17:30.0448574Z --- Runtime Warning: 2 / 10 --- 2025-12-04T09:17:30.0448749Z example = 2025-12-04T09:17:30.0449640Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py:1392: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /var/lib/jenkins/workspace/c10/core/TensorImpl.h:1973.) 2025-12-04T09:17:30.0449766Z return super().refine_names(names) 2025-12-04T09:17:30.0449829Z 2025-12-04T09:17:30.0449950Z --- Runtime Warning: 3 / 10 --- 2025-12-04T09:17:30.0450194Z example = 2025-12-04T09:17:30.0450631Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py:275: UserWarning: Warning only once for all operators, other operators may also be overridden. 2025-12-04T09:17:30.0450849Z Overriding a previously registered kernel for the same operator and the same dispatch key 2025-12-04T09:17:30.0450989Z operator: aten::div.Tensor(Tensor self, Tensor other) -> Tensor 2025-12-04T09:17:30.0451192Z registered at /var/lib/jenkins/workspace/build/aten/src/ATen/RegisterSchema.cpp:6 2025-12-04T09:17:30.0451295Z dispatch key: CPU 2025-12-04T09:17:30.0451586Z previous kernel: registered at /var/lib/jenkins/workspace/aten/src/ATen/LegacyBatchingRegistrations.cpp:1079 2025-12-04T09:17:30.0452249Z new kernel: registered at :1 (Triggered internally at /var/lib/jenkins/workspace/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.) 2025-12-04T09:17:30.0452415Z impl_fn(self.ns, name.split("::")[-1], dispatch_key) 2025-12-04T09:17:30.0452517Z 2025-12-04T09:17:30.0452648Z --- Runtime Warning: 4 / 10 --- 2025-12-04T09:17:30.0452813Z example = 2025-12-04T09:17:30.0454039Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py:117: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /var/lib/jenkins/workspace/aten/src/ATen/NestedTensorImpl.cpp:178.) 2025-12-04T09:17:30.0454227Z return torch._nested_tensor_from_tensor_list(ts, dtype, None, device, None) 2025-12-04T09:17:30.0454284Z 2025-12-04T09:17:30.0454408Z --- Runtime Warning: 5 / 10 --- 2025-12-04T09:17:30.0454590Z example = 2025-12-04T09:17:30.0455672Z :1: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at /var/lib/jenkins/workspace/aten/src/ATen/SparseCsrTensorImpl.cpp:53.) 2025-12-04T09:17:30.0455737Z 2025-12-04T09:17:30.0455864Z --- Runtime Warning: 6 / 10 --- 2025-12-04T09:17:30.0456082Z example = 2025-12-04T09:17:30.0457076Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/const_fold.py:314: UserWarning: Attempted to insert a get_attr Node with no underlying reference in the owning GraphModule! Call GraphModule.add_submodule to add the necessary submodule, GraphModule.add_parameter to add the necessary Parameter, or nn.Module.register_buffer to add the necessary buffer 2025-12-04T09:17:30.0457201Z new_node = root_const_gm.graph.get_attr(in_node.target) 2025-12-04T09:17:30.0457256Z 2025-12-04T09:17:30.0457376Z --- Runtime Warning: 7 / 10 --- 2025-12-04T09:17:30.0457583Z example = 2025-12-04T09:17:30.0458348Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py:144: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance) 2025-12-04T09:17:30.0458442Z self.encoder = TransformerEncoder( 2025-12-04T09:17:30.0458501Z 2025-12-04T09:17:30.0458624Z --- Runtime Warning: 8 / 10 --- 2025-12-04T09:17:30.0458889Z example = 2025-12-04T09:17:30.0459692Z :2: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance) 2025-12-04T09:17:30.0459745Z 2025-12-04T09:17:30.0459872Z --- Runtime Warning: 9 / 10 --- 2025-12-04T09:17:30.0460092Z example = 2025-12-04T09:17:30.0460639Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. 2025-12-04T09:17:30.0460720Z WeightNorm.apply(module, name, dim) 2025-12-04T09:17:30.0460783Z 2025-12-04T09:17:30.0460916Z --- Runtime Warning: 10 / 10 --- 2025-12-04T09:17:30.0461162Z example = 2025-12-04T09:17:30.0461700Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. 2025-12-04T09:17:30.0461961Z WeightNorm.apply(module, name, dim) 2025-12-04T09:17:30.0462019Z 2025-12-04T09:17:30.0462251Z === 378 passed, 516 skipped, 27 warnings in 15.88 seconds === 2025-12-04T09:17:30.0462434Z Finished doctests 1/1 ... [2025-12-04 09:17:30.015482][1493.679293644], took 0.27min 2025-12-04T09:17:30.0462624Z Running inductor/test_cutlass_backend 1/1 ... [2025-12-04 09:17:30.017735][1493.681554444] 2025-12-04T09:17:30.0462708Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:30.0463441Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cutlass_backend.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:30.018093] 2025-12-04T09:17:35.3341162Z 2025-12-04T09:17:35.3342281Z inductor/test_cutlass_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cutlass_backend_1.1_b129286a599ec374_.log 2025-12-04T09:17:35.3342981Z 2025-12-04T09:17:35.3343277Z Finished inductor/test_cutlass_backend 1/1 ... [2025-12-04 09:17:35.334048][1498.997864336], took 0.09min 2025-12-04T09:17:35.3363334Z Running inductor/test_benchmark_fusion 1/1 ... [2025-12-04 09:17:35.336105][1498.999925963] 2025-12-04T09:17:35.3363790Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:35.3367067Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_benchmark_fusion.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:35.336458] 2025-12-04T09:17:41.5267356Z 2025-12-04T09:17:41.5268591Z inductor/test_benchmark_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_benchmark_fusion_1.1_21e3a2205906c0e4_.log 2025-12-04T09:17:41.5305150Z Running 100 items in this shard: test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code 2025-12-04T09:17:41.5334966Z 2025-12-04T09:17:41.5335205Z Finished inductor/test_benchmark_fusion 1/1 ... [2025-12-04 09:17:41.526573][1505.190390979], took 0.10min 2025-12-04T09:17:41.5336009Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_benchmark_fusion/inductor.test_benchmark_fusion-e609d0354be0d1fa.xml 2025-12-04T09:17:41.6114346Z Running inductor/test_distributed_patterns 1/1 ... [2025-12-04 09:17:41.611187][1505.275007489] 2025-12-04T09:17:41.6115067Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:41.6118068Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_distributed_patterns.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:41.611495] 2025-12-04T09:17:47.0179530Z 2025-12-04T09:17:47.0180652Z inductor/test_distributed_patterns 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_distributed_patterns_1.1_37bc78cbf755a342_.log 2025-12-04T09:17:47.0181660Z Running 0 items in this shard: 2025-12-04T09:17:47.0181889Z 2025-12-04T09:17:47.0182271Z Finished inductor/test_distributed_patterns 1/1 ... [2025-12-04 09:17:47.017733][1510.681550819], took 0.09min 2025-12-04T09:17:47.0201334Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_distributed_patterns/inductor.test_distributed_patterns-ec95bb2c31d278a1.xml 2025-12-04T09:17:47.0444548Z Running test_sort_and_select 1/1 ... [2025-12-04 09:17:47.044248][1510.708069049] 2025-12-04T09:17:47.0445084Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:47.0448203Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_sort_and_select.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:47.044538] 2025-12-04T09:17:50.4033628Z 2025-12-04T09:17:50.4034556Z test_sort_and_select 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_sort_and_select_1.1_be558fc611a376c0_.log 2025-12-04T09:17:50.4035420Z Running 0 items in this shard: 2025-12-04T09:17:50.4035888Z 2025-12-04T09:17:50.4036206Z Finished test_sort_and_select 1/1 ... [2025-12-04 09:17:50.403102][1514.066919587], took 0.06min 2025-12-04T09:17:50.4056735Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_sort_and_select/test_sort_and_select-7d6d7bedd825d12a.xml 2025-12-04T09:17:50.4322892Z Running test_privateuseone_python_backend 1/1 ... [2025-12-04 09:17:50.432044][1514.095864671] 2025-12-04T09:17:50.4323488Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:50.4326268Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_privateuseone_python_backend.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:50.432333] 2025-12-04T09:17:53.2082482Z 2025-12-04T09:17:53.2083621Z test_privateuseone_python_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_privateuseone_python_backend_1.1_67f192d2bb258afb_.log 2025-12-04T09:17:53.2084617Z Running 0 items in this shard: 2025-12-04T09:17:53.2084829Z 2025-12-04T09:17:53.2085204Z Finished test_privateuseone_python_backend 1/1 ... [2025-12-04 09:17:53.208019][1516.871837703], took 0.05min 2025-12-04T09:17:53.2107372Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-5b1698bc2f192c0d.xml 2025-12-04T09:17:53.2396437Z Running test_cpp_api_parity 1/1 ... [2025-12-04 09:17:53.239416][1516.90323706] 2025-12-04T09:17:53.2396939Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:17:53.2399938Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_api_parity.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:17:53.239705] 2025-12-04T09:18:01.3871828Z 2025-12-04T09:18:01.3872807Z test_cpp_api_parity 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_api_parity_1.1_e38423beb940dabc_.log 2025-12-04T09:18:01.3873749Z Running 0 items in this shard: 2025-12-04T09:18:01.3873989Z 2025-12-04T09:18:01.3874319Z Finished test_cpp_api_parity 1/1 ... [2025-12-04 09:18:01.386984][1525.050802064], took 0.14min 2025-12-04T09:18:01.3897623Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-410fc5b37a9a9af7.xml 2025-12-04T09:18:01.4686170Z Running test_extension_utils 1/1 ... [2025-12-04 09:18:01.468393][1525.132213316] 2025-12-04T09:18:01.4686622Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:18:01.4689416Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_extension_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:18:01.468678] 2025-12-04T09:18:04.2471196Z 2025-12-04T09:18:04.2472138Z test_extension_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_extension_utils_1.1_6f448e6667adee0f_.log 2025-12-04T09:18:04.2473013Z Running 0 items in this shard: 2025-12-04T09:18:04.2473230Z 2025-12-04T09:18:04.2473544Z Finished test_extension_utils 1/1 ... [2025-12-04 09:18:04.246892][1527.910709474], took 0.05min 2025-12-04T09:18:04.2498009Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_extension_utils/test_extension_utils-8ecaa1c21c0294f7.xml 2025-12-04T09:18:04.2756562Z Running test_show_pickle 1/1 ... [2025-12-04 09:18:04.275452][1527.939273029] 2025-12-04T09:18:04.2757085Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:18:04.2760361Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_show_pickle.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:18:04.275735] 2025-12-04T09:18:07.0515740Z 2025-12-04T09:18:07.0516613Z test_show_pickle 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_show_pickle_1.1_c585acf62ef1069a_.log 2025-12-04T09:18:07.0517384Z Running 0 items in this shard: 2025-12-04T09:18:07.0517575Z 2025-12-04T09:18:07.0517835Z Finished test_show_pickle 1/1 ... [2025-12-04 09:18:07.051369][1530.715187151], took 0.05min 2025-12-04T09:18:07.0544604Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_show_pickle/test_show_pickle-f61e02e14ce79045.xml 2025-12-04T09:18:07.0788936Z Running test_torch 1/1 ... [2025-12-04 09:18:07.078694][1530.742514596] 2025-12-04T09:18:07.0789369Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:18:07.0792326Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_torch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:18:07.078983] 2025-12-04T09:18:25.1649579Z 2025-12-04T09:18:25.1651900Z PRINTING LOG FILE of test_torch 1/1 (test/test-reports/test_torch_1.1_b2685fecb603b132_.log) 2025-12-04T09:18:25.1652962Z Test results will be stored in test-reports/python-pytest/test_torch/test_torch-86cbf7756d7c226b.xml 2025-12-04T09:18:25.1653621Z ============================= test session starts ============================== 2025-12-04T09:18:25.1654249Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:18:25.1654824Z cachedir: .pytest_cache 2025-12-04T09:18:25.1655504Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:18:25.1656228Z rootdir: /var/lib/jenkins/workspace 2025-12-04T09:18:25.1656551Z configfile: pytest.ini 2025-12-04T09:18:25.1657114Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:18:25.1657713Z collecting ... collected 976 items 2025-12-04T09:18:25.1658021Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:18:25.1682510Z Running 150 items in this shard: test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup 2025-12-04T09:18:25.1704688Z 2025-12-04T09:18:25.1704877Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.5913s] [ 0%] 2025-12-04T09:18:25.1705534Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 1%] 2025-12-04T09:18:25.1706450Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1707133Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1480s] [ 2%] 2025-12-04T09:18:25.1707531Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.3069s] [ 2%] 2025-12-04T09:18:25.1707935Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2853s] [ 2%] 2025-12-04T09:18:25.1708342Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2760s] [ 2%] 2025-12-04T09:18:25.1708730Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2894s] [ 2%] 2025-12-04T09:18:25.1709124Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2978s] [ 2%] 2025-12-04T09:18:25.1709561Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2504s] [ 2%] 2025-12-04T09:18:25.1709944Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2742s] [ 2%] 2025-12-04T09:18:25.1710331Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.0976s] [ 2%] 2025-12-04T09:18:25.1710768Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2393s] [ 2%] 2025-12-04T09:18:25.1711162Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2461s] [ 2%] 2025-12-04T09:18:25.1711543Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2402s] [ 2%] 2025-12-04T09:18:25.1711934Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2768s] [ 2%] 2025-12-04T09:18:25.1712323Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2231s] [ 2%] 2025-12-04T09:18:25.1712708Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2578s] [ 2%] 2025-12-04T09:18:25.1713136Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1554s] [ 2%] 2025-12-04T09:18:25.1713526Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2606s] [ 2%] 2025-12-04T09:18:25.1713912Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2719s] [ 2%] 2025-12-04T09:18:25.1714295Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.0910s] [ 2%] 2025-12-04T09:18:25.1714713Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2592s] [ 2%] 2025-12-04T09:18:25.1715101Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2893s] [ 2%] 2025-12-04T09:18:25.1715477Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1907s] [ 2%] 2025-12-04T09:18:25.1715857Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2635s] [ 2%] 2025-12-04T09:18:25.1716247Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2480s] [ 2%] 2025-12-04T09:18:25.1716648Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2532s] [ 2%] 2025-12-04T09:18:25.1717039Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2454s] [ 2%] 2025-12-04T09:18:25.1717427Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2503s] [ 2%] 2025-12-04T09:18:25.1717811Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.0914s] [ 2%] 2025-12-04T09:18:25.1718193Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2345s] [ 2%] 2025-12-04T09:18:25.1718574Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2918s] [ 2%] 2025-12-04T09:18:25.1718972Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2711s] [ 2%] 2025-12-04T09:18:25.1719355Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2421s] [ 2%] 2025-12-04T09:18:25.1719731Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1108s] [ 2%] 2025-12-04T09:18:25.1720115Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2393s] [ 2%] 2025-12-04T09:18:25.1720502Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1733s] [ 2%] 2025-12-04T09:18:25.1720883Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2662s] [ 2%] 2025-12-04T09:18:25.1721259Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2720s] [ 2%] 2025-12-04T09:18:25.1721642Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2915s] [ 2%] 2025-12-04T09:18:25.1722027Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2664s] [ 2%] 2025-12-04T09:18:25.1722408Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2654s] [ 2%] 2025-12-04T09:18:25.1722793Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2477s] [ 2%] 2025-12-04T09:18:25.1723175Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1108s] [ 2%] 2025-12-04T09:18:25.1723557Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2421s] [ 2%] 2025-12-04T09:18:25.1723939Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2740s] [ 2%] 2025-12-04T09:18:25.1724396Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2670s] [ 2%] 2025-12-04T09:18:25.1724784Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2570s] [ 2%] 2025-12-04T09:18:25.1725208Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.0933s] [ 2%] 2025-12-04T09:18:25.1725587Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2709s] [ 2%] 2025-12-04T09:18:25.1725973Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2525s] [ 2%] 2025-12-04T09:18:25.1726604Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1727450Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1728332Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1729164Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1730035Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1730866Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1731690Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1732634Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1733471Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1734309Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1735141Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1735961Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1736798Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1737631Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1738458Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1739286Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1740117Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1740947Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1741838Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1742672Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1743529Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1744354Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1745185Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1746069Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1746904Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1747764Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1748593Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1749419Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1750248Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1751071Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1751901Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1752734Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1753558Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1754383Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1755212Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1756058Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1756892Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1757721Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1758548Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1759411Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1760259Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1761124Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1762139Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1762966Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1763872Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1764699Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1765576Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1766418Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1767258Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1768156Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1769112Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1770072Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1771027Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1771970Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1773022Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1773982Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1774949Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1775903Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1776855Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1777868Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1778829Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1779844Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1780819Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1781768Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1782760Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1783715Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1784717Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1785684Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1786646Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1787599Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1788553Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1789506Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1790454Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1791419Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1792376Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1793331Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1794285Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1795236Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1796226Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1797182Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1798175Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1799355Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1800311Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1801318Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1802273Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1803265Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1804221Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1813749Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1814771Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1815769Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1816750Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1817713Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1818670Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1819638Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1820604Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1821563Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1822510Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1823541Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:18:25.1824064Z 2025-12-04T09:18:25.1824166Z =================================== FAILURES =================================== 2025-12-04T09:18:25.1824544Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1824846Z Traceback (most recent call last): 2025-12-04T09:18:25.1825220Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1825614Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1825943Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1826304Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1826865Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1827385Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1827687Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1827852Z 2025-12-04T09:18:25.1827928Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1828250Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1828719Z Greatest relative difference: 0.01470947265625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1828989Z 2025-12-04T09:18:25.1829126Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1829479Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1829693Z 2025-12-04T09:18:25.1829851Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1830237Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1830538Z Traceback (most recent call last): 2025-12-04T09:18:25.1830897Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1831272Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1831582Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1831931Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1832420Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1832927Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1833209Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1833371Z 2025-12-04T09:18:25.1833445Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1833756Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1834190Z Greatest relative difference: 0.01470947265625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1834456Z 2025-12-04T09:18:25.1834586Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1834934Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1835150Z 2025-12-04T09:18:25.1835316Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1835691Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1835989Z Traceback (most recent call last): 2025-12-04T09:18:25.1836355Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1836734Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1837048Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1837395Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1837935Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1838452Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1838744Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1838945Z 2025-12-04T09:18:25.1839021Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1839340Z Greatest absolute difference: 0.03125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1839773Z Greatest relative difference: 0.01495361328125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1840053Z 2025-12-04T09:18:25.1840182Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1840521Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1840727Z 2025-12-04T09:18:25.1840887Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1841302Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1841601Z Traceback (most recent call last): 2025-12-04T09:18:25.1841959Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1842326Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1842640Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1842983Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1843507Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1844009Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1844296Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1844454Z 2025-12-04T09:18:25.1844537Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1844848Z Greatest absolute difference: 0.03125 at index (1, 485, 130) (up to 0.01 allowed) 2025-12-04T09:18:25.1845284Z Greatest relative difference: 0.01495361328125 at index (1, 485, 130) (up to 0.01 allowed) 2025-12-04T09:18:25.1845557Z 2025-12-04T09:18:25.1845681Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1846022Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1846232Z 2025-12-04T09:18:25.1846387Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1846770Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1847067Z Traceback (most recent call last): 2025-12-04T09:18:25.1847429Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1847798Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1848110Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1848457Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1848945Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1849456Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1849752Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1849921Z 2025-12-04T09:18:25.1850001Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1850313Z Greatest absolute difference: 0.03125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1850750Z Greatest relative difference: 0.01495361328125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1851019Z 2025-12-04T09:18:25.1851151Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1851482Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1851695Z 2025-12-04T09:18:25.1851853Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1852273Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1852644Z Traceback (most recent call last): 2025-12-04T09:18:25.1852995Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1853410Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1853737Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1854090Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1854571Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1855086Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1855374Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1855534Z 2025-12-04T09:18:25.1855649Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1855965Z Greatest absolute difference: 0.046875 at index (1, 197, 130) (up to 0.01 allowed) 2025-12-04T09:18:25.1856417Z Greatest relative difference: 0.01324462890625 at index (1, 197, 130) (up to 0.01 allowed) 2025-12-04T09:18:25.1856686Z 2025-12-04T09:18:25.1856815Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1857146Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1857358Z 2025-12-04T09:18:25.1857549Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1857928Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1858218Z Traceback (most recent call last): 2025-12-04T09:18:25.1858575Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1858949Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1859265Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1859607Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1860100Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1860604Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1860893Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1861050Z 2025-12-04T09:18:25.1861125Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1861434Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1862047Z Greatest relative difference: 0.01495361328125 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1862323Z 2025-12-04T09:18:25.1862448Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1862792Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1863009Z 2025-12-04T09:18:25.1863163Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1863542Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1863829Z Traceback (most recent call last): 2025-12-04T09:18:25.1864179Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1864554Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1864860Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1865204Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1865689Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1866189Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1866476Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1866644Z 2025-12-04T09:18:25.1866717Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1867108Z Greatest absolute difference: 0.03125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1867544Z Greatest relative difference: 0.01495361328125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1867869Z 2025-12-04T09:18:25.1867994Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1868334Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1868543Z 2025-12-04T09:18:25.1868707Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1869080Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1869379Z Traceback (most recent call last): 2025-12-04T09:18:25.1869736Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1870188Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1870498Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1870848Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1871353Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1871863Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1872158Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1872376Z 2025-12-04T09:18:25.1872454Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1872762Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1873192Z Greatest relative difference: 0.01495361328125 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1873466Z 2025-12-04T09:18:25.1873592Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1873938Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1874147Z 2025-12-04T09:18:25.1874312Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1874685Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T09:18:25.1874986Z Traceback (most recent call last): 2025-12-04T09:18:25.1875341Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T09:18:25.1875710Z helper(dim, dtype, device, size, size) 2025-12-04T09:18:25.1876034Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T09:18:25.1876386Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T09:18:25.1876878Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:18:25.1877388Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:18:25.1877685Z AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1877845Z 2025-12-04T09:18:25.1877926Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1878233Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1878670Z Greatest relative difference: 0.01470947265625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1878943Z 2025-12-04T09:18:25.1879065Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1879406Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1879614Z 2025-12-04T09:18:25.1879770Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1880341Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_torch/test_torch-86cbf7756d7c226b.xml - 2025-12-04T09:18:25.1880872Z =========================== short test summary info ============================ 2025-12-04T09:18:25.1881326Z FAILED [0.1480s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1881659Z 2025-12-04T09:18:25.1881779Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1882094Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1882568Z Greatest relative difference: 0.01470947265625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1882844Z 2025-12-04T09:18:25.1882979Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1883325Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1883535Z 2025-12-04T09:18:25.1883699Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1884190Z FAILED [0.0976s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1884532Z 2025-12-04T09:18:25.1884646Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1884969Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1885413Z Greatest relative difference: 0.01470947265625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1885679Z 2025-12-04T09:18:25.1885805Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1886148Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1886358Z 2025-12-04T09:18:25.1886562Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1887056Z FAILED [0.1554s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1887388Z 2025-12-04T09:18:25.1887465Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1887783Z Greatest absolute difference: 0.03125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1888227Z Greatest relative difference: 0.01495361328125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1888495Z 2025-12-04T09:18:25.1888626Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1888955Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1889167Z 2025-12-04T09:18:25.1889319Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1889804Z FAILED [0.0910s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1890134Z 2025-12-04T09:18:25.1890207Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1890514Z Greatest absolute difference: 0.03125 at index (1, 485, 130) (up to 0.01 allowed) 2025-12-04T09:18:25.1890945Z Greatest relative difference: 0.01495361328125 at index (1, 485, 130) (up to 0.01 allowed) 2025-12-04T09:18:25.1891208Z 2025-12-04T09:18:25.1891335Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1891659Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1891867Z 2025-12-04T09:18:25.1892022Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1892553Z FAILED [0.1907s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1892885Z 2025-12-04T09:18:25.1892975Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1893280Z Greatest absolute difference: 0.03125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1893710Z Greatest relative difference: 0.01495361328125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1893976Z 2025-12-04T09:18:25.1894105Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1894435Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1894640Z 2025-12-04T09:18:25.1894793Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1895279Z FAILED [0.0914s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1895653Z 2025-12-04T09:18:25.1895733Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1896048Z Greatest absolute difference: 0.046875 at index (1, 197, 130) (up to 0.01 allowed) 2025-12-04T09:18:25.1896519Z Greatest relative difference: 0.01324462890625 at index (1, 197, 130) (up to 0.01 allowed) 2025-12-04T09:18:25.1896789Z 2025-12-04T09:18:25.1896914Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1897243Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1897449Z 2025-12-04T09:18:25.1897603Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1898089Z FAILED [0.1108s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1898464Z 2025-12-04T09:18:25.1898545Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1898852Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1899278Z Greatest relative difference: 0.01495361328125 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1899549Z 2025-12-04T09:18:25.1899670Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1900006Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1900210Z 2025-12-04T09:18:25.1900406Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1900886Z FAILED [0.1733s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1901221Z 2025-12-04T09:18:25.1901293Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T09:18:25.1901602Z Greatest absolute difference: 0.03125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1902034Z Greatest relative difference: 0.01495361328125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T09:18:25.1902298Z 2025-12-04T09:18:25.1902424Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1902754Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1902960Z 2025-12-04T09:18:25.1903125Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1903609Z FAILED [0.1108s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1903937Z 2025-12-04T09:18:25.1904011Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1904318Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1904744Z Greatest relative difference: 0.01495361328125 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1905008Z 2025-12-04T09:18:25.1905131Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1905467Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1905676Z 2025-12-04T09:18:25.1905830Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1906319Z FAILED [0.0933s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T09:18:25.1906647Z 2025-12-04T09:18:25.1906731Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T09:18:25.1907042Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1907469Z Greatest relative difference: 0.01470947265625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T09:18:25.1907731Z 2025-12-04T09:18:25.1907858Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:25.1908183Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T09:18:25.1908392Z 2025-12-04T09:18:25.1908546Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:25.1908899Z ================= 10 failed, 40 passed, 100 skipped in 13.59s ================== 2025-12-04T09:18:25.1909139Z 2025-12-04T09:18:25.1909380Z FINISHED PRINTING LOG FILE of test_torch 1/1 (test/test-reports/test_torch_1.1_b2685fecb603b132_.log) 2025-12-04T09:18:25.1909681Z 2025-12-04T09:18:25.1909872Z Finished test_torch 1/1 ... [2025-12-04 09:18:25.164937][1548.828751569], took 0.30min 2025-12-04T09:18:25.1910477Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_torch/test_torch-86cbf7756d7c226b.xml 2025-12-04T09:18:26.5047461Z Uploading logs for 57116084853 to S3 2025-12-04T09:18:26.6999528Z Uploading artifacts took 1.45 seconds 2025-12-04T09:18:26.6999928Z test_torch 1/1 failed! 2025-12-04T09:18:26.7003859Z Running test_tensorexpr 1/1 ... [2025-12-04 09:18:26.700166][1550.363980567] 2025-12-04T09:18:26.7004402Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:18:26.7009052Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_tensorexpr.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:18:26.700628] 2025-12-04T09:18:29.5780733Z 2025-12-04T09:18:29.5781723Z test_tensorexpr 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_tensorexpr_1.1_8f889bbf725904b3_.log 2025-12-04T09:18:29.5782607Z Running 0 items in this shard: 2025-12-04T09:18:29.5783082Z 2025-12-04T09:18:29.5783397Z Finished test_tensorexpr 1/1 ... [2025-12-04 09:18:29.577867][1553.241683795], took 0.05min 2025-12-04T09:18:29.5813917Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_tensorexpr/test_tensorexpr-d7faab3c9028401e.xml 2025-12-04T09:18:29.6245461Z Running test_utils 1/1 ... [2025-12-04 09:18:29.624318][1553.288137465] 2025-12-04T09:18:29.6245985Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:18:29.6249504Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:18:29.624667] 2025-12-04T09:18:42.5454227Z 2025-12-04T09:18:42.5455390Z test_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_utils_1.1_54bd9f1f48dc35ff_.log 2025-12-04T09:18:42.5476812Z Running 100 items in this shard: test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed 2025-12-04T09:18:42.5493223Z 2025-12-04T09:18:42.5493399Z Finished test_utils 1/1 ... [2025-12-04 09:18:42.545215][1566.20903167], took 0.22min 2025-12-04T09:18:42.5494705Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_utils/test_utils-d587f66af7496981.xml 2025-12-04T09:18:42.6352065Z Running test_namedtuple_return_api 1/1 ... [2025-12-04 09:18:42.634963][1566.298781304] 2025-12-04T09:18:42.6352647Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:18:42.6355814Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_namedtuple_return_api.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:18:42.635290] 2025-12-04T09:18:45.4278632Z 2025-12-04T09:18:45.4279661Z test_namedtuple_return_api 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_namedtuple_return_api_1.1_e16a2c1bff68b907_.log 2025-12-04T09:18:45.4280665Z Running 0 items in this shard: 2025-12-04T09:18:45.4280859Z 2025-12-04T09:18:45.4281134Z Finished test_namedtuple_return_api 1/1 ... [2025-12-04 09:18:45.427659][1569.091476485], took 0.05min 2025-12-04T09:18:45.4313013Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_namedtuple_return_api/test_namedtuple_return_api-bfbca5bc9e064895.xml 2025-12-04T09:18:45.4589807Z Running test_fake_tensor 1/1 ... [2025-12-04 09:18:45.458741][1569.122560939] 2025-12-04T09:18:45.4590299Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:18:45.4593188Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_fake_tensor.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:18:45.459061] 2025-12-04T09:18:52.5507682Z 2025-12-04T09:18:52.5508332Z PRINTING LOG FILE of test_fake_tensor 1/1 (test/test-reports/test_fake_tensor_1.1_6cc5e0685466fadf_.log) 2025-12-04T09:18:52.5509028Z Test results will be stored in test-reports/python-pytest/test_fake_tensor/test_fake_tensor-1a668b992cb70372.xml 2025-12-04T09:18:52.5509547Z ============================= test session starts ============================== 2025-12-04T09:18:52.5509985Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:18:52.5510424Z cachedir: .pytest_cache 2025-12-04T09:18:52.5510903Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:18:52.5511414Z rootdir: /var/lib/jenkins/workspace 2025-12-04T09:18:52.5511646Z configfile: pytest.ini 2025-12-04T09:18:52.5512111Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:18:52.5512625Z collecting ... collected 288 items 2025-12-04T09:18:52.5512891Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:18:52.5524656Z Running 50 items in this shard: test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to 2025-12-04T09:18:52.5535212Z 2025-12-04T09:18:52.5535431Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to FAILED [0.0306s] [ 2%] 2025-12-04T09:18:52.5535930Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to FAILED [0.0081s] [ 2%] 2025-12-04T09:18:52.5536408Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0081s] [ 2%] 2025-12-04T09:18:52.5536890Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5537366Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5537840Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5538311Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5538787Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5539321Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5539799Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5540334Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5540808Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5541283Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5541765Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5542245Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5542720Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5543244Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5543722Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5544199Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5544678Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5545193Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5545683Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5546155Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5546628Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5547099Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5547578Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5548051Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5548525Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5548995Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0081s] [ 2%] 2025-12-04T09:18:52.5549467Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0077s] [ 2%] 2025-12-04T09:18:52.5549942Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5550560Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5551037Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5551528Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5552005Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5552470Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5552942Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0081s] [ 2%] 2025-12-04T09:18:52.5553416Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5553887Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5554354Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5554828Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5555313Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5555863Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T09:18:52.5556348Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5556817Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5557330Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5557806Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5558276Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5558751Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5559234Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T09:18:52.5559550Z 2025-12-04T09:18:52.5559651Z =================================== FAILURES =================================== 2025-12-04T09:18:52.5559985Z _________________ FakeTensorOperatorInvariants.test_module_to __________________ 2025-12-04T09:18:52.5560304Z Traceback (most recent call last): 2025-12-04T09:18:52.5560714Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 986, in _apply 2025-12-04T09:18:52.5561153Z torch.utils.swap_tensors(param, param_applied) 2025-12-04T09:18:52.5561623Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/__init__.py", line 45, in swap_tensors 2025-12-04T09:18:52.5562430Z raise RuntimeError("Cannot swap t1 because it has weakref associated with it") 2025-12-04T09:18:52.5562860Z RuntimeError: Cannot swap t1 because it has weakref associated with it 2025-12-04T09:18:52.5563093Z 2025-12-04T09:18:52.5563238Z The above exception was the direct cause of the following exception: 2025-12-04T09:18:52.5563466Z 2025-12-04T09:18:52.5563554Z Traceback (most recent call last): 2025-12-04T09:18:52.5563907Z File "/var/lib/jenkins/workspace/test/test_fake_tensor.py", line 1665, in test_module_to 2025-12-04T09:18:52.5564243Z m.to("cuda") 2025-12-04T09:18:52.5564585Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1383, in to 2025-12-04T09:18:52.5564983Z return self._apply(convert) 2025-12-04T09:18:52.5565370Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 990, in _apply 2025-12-04T09:18:52.5565756Z raise RuntimeError( 2025-12-04T09:18:52.5565976Z RuntimeError: _apply(): Couldn't swap Linear.weight 2025-12-04T09:18:52.5566162Z 2025-12-04T09:18:52.5566289Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:52.5566674Z python test/test_fake_tensor.py FakeTensorOperatorInvariants.test_module_to 2025-12-04T09:18:52.5566926Z 2025-12-04T09:18:52.5567089Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:52.5567491Z _________________ FakeTensorOperatorInvariants.test_module_to __________________ 2025-12-04T09:18:52.5567806Z Traceback (most recent call last): 2025-12-04T09:18:52.5568198Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 986, in _apply 2025-12-04T09:18:52.5568625Z torch.utils.swap_tensors(param, param_applied) 2025-12-04T09:18:52.5569055Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/__init__.py", line 47, in swap_tensors 2025-12-04T09:18:52.5569555Z raise RuntimeError("Cannot swap t2 because it has weakref associated with it") 2025-12-04T09:18:52.5569953Z RuntimeError: Cannot swap t2 because it has weakref associated with it 2025-12-04T09:18:52.5570187Z 2025-12-04T09:18:52.5570332Z The above exception was the direct cause of the following exception: 2025-12-04T09:18:52.5570560Z 2025-12-04T09:18:52.5570634Z Traceback (most recent call last): 2025-12-04T09:18:52.5570979Z File "/var/lib/jenkins/workspace/test/test_fake_tensor.py", line 1665, in test_module_to 2025-12-04T09:18:52.5571400Z m.to("cuda") 2025-12-04T09:18:52.5571762Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1383, in to 2025-12-04T09:18:52.5572159Z return self._apply(convert) 2025-12-04T09:18:52.5572677Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 990, in _apply 2025-12-04T09:18:52.5573083Z raise RuntimeError( 2025-12-04T09:18:52.5573309Z RuntimeError: _apply(): Couldn't swap Linear.bias 2025-12-04T09:18:52.5573484Z 2025-12-04T09:18:52.5573628Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:52.5574012Z python test/test_fake_tensor.py FakeTensorOperatorInvariants.test_module_to 2025-12-04T09:18:52.5574274Z 2025-12-04T09:18:52.5574436Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:52.5575136Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_fake_tensor/test_fake_tensor-1a668b992cb70372.xml - 2025-12-04T09:18:52.5575693Z =========================== short test summary info ============================ 2025-12-04T09:18:52.5576226Z FAILED [0.0306s] test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to - RuntimeError: _apply(): Couldn't swap Linear.weight 2025-12-04T09:18:52.5576643Z 2025-12-04T09:18:52.5576770Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:52.5577207Z python test/test_fake_tensor.py FakeTensorOperatorInvariants.test_module_to 2025-12-04T09:18:52.5577460Z 2025-12-04T09:18:52.5577627Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:52.5578180Z FAILED [0.0081s] test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to - RuntimeError: _apply(): Couldn't swap Linear.bias 2025-12-04T09:18:52.5578583Z 2025-12-04T09:18:52.5578708Z To execute this test, run the following from the base repo dir: 2025-12-04T09:18:52.5579079Z python test/test_fake_tensor.py FakeTensorOperatorInvariants.test_module_to 2025-12-04T09:18:52.5579329Z 2025-12-04T09:18:52.5579488Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:18:52.5579817Z ========================= 2 failed, 48 passed in 0.90s ========================= 2025-12-04T09:18:52.5580006Z 2025-12-04T09:18:52.5580257Z FINISHED PRINTING LOG FILE of test_fake_tensor 1/1 (test/test-reports/test_fake_tensor_1.1_6cc5e0685466fadf_.log) 2025-12-04T09:18:52.5580586Z 2025-12-04T09:18:52.5580764Z Finished test_fake_tensor 1/1 ... [2025-12-04 09:18:52.550577][1576.214394087], took 0.12min 2025-12-04T09:18:52.5581414Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_fake_tensor/test_fake_tensor-1a668b992cb70372.xml 2025-12-04T09:18:54.9216133Z Uploading logs for 57116084853 to S3 2025-12-04T09:18:55.0929721Z Uploading artifacts took 2.40 seconds 2025-12-04T09:18:55.0930162Z test_fake_tensor 1/1 failed! 2025-12-04T09:18:55.0934038Z Running test_multiprocessing 1/1 ... [2025-12-04 09:18:55.093174][1578.756989305] 2025-12-04T09:18:55.0934617Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:18:55.0939108Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_multiprocessing.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:18:55.093607] 2025-12-04T09:23:54.6522024Z 2025-12-04T09:23:54.6522957Z test_multiprocessing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_multiprocessing_1.1_79160357459cb2c2_.log 2025-12-04T09:23:54.6548226Z Running 100 items in this shard: test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple 2025-12-04T09:23:54.6569756Z 2025-12-04T09:23:54.6569980Z Finished test_multiprocessing 1/1 ... [2025-12-04 09:23:54.652060][1878.315874742], took 4.99min 2025-12-04T09:23:54.6570748Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_multiprocessing/test_multiprocessing-d56d6df4e289f373.xml 2025-12-04T09:23:54.7447281Z Running test_fx 1/1 ... [2025-12-04 09:23:54.744482][1878.408300364] 2025-12-04T09:23:54.7447743Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:23:54.7450880Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_fx.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:23:54.744793] 2025-12-04T09:26:30.8344567Z 2025-12-04T09:26:30.8347438Z test_fx 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_fx_1.1_c23d985f9096efa2_.log 2025-12-04T09:26:30.8365358Z Running 100 items in this shard: test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7 2025-12-04T09:26:30.8381894Z 2025-12-04T09:26:30.8382069Z Finished test_fx 1/1 ... [2025-12-04 09:26:30.834284][2034.49809947], took 2.60min 2025-12-04T09:26:30.8385750Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_fx/test_fx-75f810e34c21fdae.xml 2025-12-04T09:26:30.9293293Z Running test_autograd_fallback 1/1 ... [2025-12-04 09:26:30.929060][2034.592878185] 2025-12-04T09:26:30.9293945Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:30.9297587Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_autograd_fallback.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:30.929404] 2025-12-04T09:26:33.6924274Z 2025-12-04T09:26:33.6925249Z test_autograd_fallback 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_autograd_fallback_1.1_d1fd30600f5e8389_.log 2025-12-04T09:26:33.6926120Z Running 0 items in this shard: 2025-12-04T09:26:33.6926326Z 2025-12-04T09:26:33.6926642Z Finished test_autograd_fallback 1/1 ... [2025-12-04 09:26:33.692184][2037.356001054], took 0.05min 2025-12-04T09:26:33.6964313Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_autograd_fallback/test_autograd_fallback-af9a7a31daae2066.xml 2025-12-04T09:26:33.7239580Z Running test_autocast 1/1 ... [2025-12-04 09:26:33.723730][2037.387550168] 2025-12-04T09:26:33.7240060Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:33.7243675Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_autocast.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:33.724059] 2025-12-04T09:26:36.8548186Z 2025-12-04T09:26:36.8549640Z test_autocast 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_autocast_1.1_8f25c3d25ce2a5fe_.log 2025-12-04T09:26:36.8551489Z Running 0 items in this shard: 2025-12-04T09:26:36.8551892Z 2025-12-04T09:26:36.8552358Z Finished test_autocast 1/1 ... [2025-12-04 09:26:36.854588][2040.518402996], took 0.05min 2025-12-04T09:26:36.8592571Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_autocast/test_autocast-6368d686b9d0a76a.xml 2025-12-04T09:26:36.8900475Z Running test_python_dispatch 1/1 ... [2025-12-04 09:26:36.889825][2040.55364411] 2025-12-04T09:26:36.8901058Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:36.8904848Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_python_dispatch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:36.890183] 2025-12-04T09:26:41.2874811Z 2025-12-04T09:26:41.2875594Z test_python_dispatch 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_python_dispatch_1.1_6d2023546169b768_.log 2025-12-04T09:26:41.2876537Z Running 0 items in this shard: 2025-12-04T09:26:41.2876741Z 2025-12-04T09:26:41.2877001Z Finished test_python_dispatch 1/1 ... [2025-12-04 09:26:41.287234][2044.951050208], took 0.07min 2025-12-04T09:26:41.2917234Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_python_dispatch/test_python_dispatch-a00658362550dfa9.xml 2025-12-04T09:26:41.3239454Z Running test_jit_disabled 1/1 ... [2025-12-04 09:26:41.323711][2044.987530434] 2025-12-04T09:26:41.3239955Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:41.3243312Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_jit_disabled.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:41.324050] 2025-12-04T09:26:44.1202034Z 2025-12-04T09:26:44.1202757Z test_jit_disabled 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_jit_disabled_1.1_f4341100659034b7_.log 2025-12-04T09:26:44.1203306Z Running 0 items in this shard: 2025-12-04T09:26:44.1203445Z 2025-12-04T09:26:44.1203651Z Finished test_jit_disabled 1/1 ... [2025-12-04 09:26:44.120009][2047.783825926], took 0.05min 2025-12-04T09:26:44.1246201Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_jit_disabled/test_jit_disabled-577956ec3f5f835e.xml 2025-12-04T09:26:44.1560494Z Running test_cpp_extensions_mtia_backend 1/1 ... [2025-12-04 09:26:44.155834][2047.81965402] 2025-12-04T09:26:44.1561091Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:44.1564623Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_mtia_backend.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:44.156159] 2025-12-04T09:26:46.9542136Z 2025-12-04T09:26:46.9542950Z test_cpp_extensions_mtia_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_mtia_backend_1.1_592f44b55155dce1_.log 2025-12-04T09:26:46.9543581Z Running 0 items in this shard: 2025-12-04T09:26:46.9543720Z 2025-12-04T09:26:46.9543960Z Finished test_cpp_extensions_mtia_backend 1/1 ... [2025-12-04 09:26:46.953998][2050.617814534], took 0.05min 2025-12-04T09:26:46.9589222Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_mtia_backend/test_cpp_extensions_mtia_backend-5e73672e9ac01a41.xml 2025-12-04T09:26:47.0444048Z Running functorch/test_memory_efficient_fusion 1/1 ... [2025-12-04 09:26:47.044145][2050.70796416] 2025-12-04T09:26:47.0444859Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:47.0447573Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_memory_efficient_fusion.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:47.044446] 2025-12-04T09:26:50.0991396Z 2025-12-04T09:26:50.0992435Z functorch/test_memory_efficient_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_memory_efficient_fusion_1.1_cd53c92f6cd6db14_.log 2025-12-04T09:26:50.0993642Z Running 0 items in this shard: 2025-12-04T09:26:50.0993821Z 2025-12-04T09:26:50.0994192Z Finished functorch/test_memory_efficient_fusion 1/1 ... [2025-12-04 09:26:50.098903][2053.762718873], took 0.05min 2025-12-04T09:26:50.1041504Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_memory_efficient_fusion/functorch.test_memory_efficient_fusion-33238d8e909cf2e4.xml 2025-12-04T09:26:50.1320089Z Running test_tensor_creation_ops 1/1 ... [2025-12-04 09:26:50.131783][2053.795602093] 2025-12-04T09:26:50.1320877Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:50.1324015Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_tensor_creation_ops.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:50.132116] 2025-12-04T09:26:54.1051117Z 2025-12-04T09:26:54.1052668Z test_tensor_creation_ops 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_tensor_creation_ops_1.1_cdc54e89ec4eb471_.log 2025-12-04T09:26:54.1054127Z Running 0 items in this shard: 2025-12-04T09:26:54.1054458Z 2025-12-04T09:26:54.1054975Z Finished test_tensor_creation_ops 1/1 ... [2025-12-04 09:26:54.104843][2057.768659202], took 0.07min 2025-12-04T09:26:54.1103596Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-7daf652f1a7b5488.xml 2025-12-04T09:26:54.1416570Z Running test_cpp_extensions_stream_and_event 1/1 ... [2025-12-04 09:26:54.141374][2057.805193427] 2025-12-04T09:26:54.1417287Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:54.1420455Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_stream_and_event.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:54.141711] 2025-12-04T09:26:56.9242002Z 2025-12-04T09:26:56.9243600Z test_cpp_extensions_stream_and_event 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_stream_and_event_1.1_71f2127b80ebb7e0_.log 2025-12-04T09:26:56.9245186Z Running 0 items in this shard: 2025-12-04T09:26:56.9245511Z 2025-12-04T09:26:56.9246092Z Finished test_cpp_extensions_stream_and_event 1/1 ... [2025-12-04 09:26:56.923958][2060.587774274], took 0.05min 2025-12-04T09:26:56.9295896Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_stream_and_event/test_cpp_extensions_stream_and_event-08f3a0dfb2065ef1.xml 2025-12-04T09:26:56.9644297Z Running test_dispatch 1/1 ... [2025-12-04 09:26:56.964150][2060.627968984] 2025-12-04T09:26:56.9644938Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:56.9648526Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_dispatch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:56.964496] 2025-12-04T09:26:59.7850037Z 2025-12-04T09:26:59.7850912Z test_dispatch 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_dispatch_1.1_618c0ce137115f87_.log 2025-12-04T09:26:59.7851954Z Running 0 items in this shard: 2025-12-04T09:26:59.7852156Z 2025-12-04T09:26:59.7852542Z Finished test_dispatch 1/1 ... [2025-12-04 09:26:59.784766][2063.44858262], took 0.05min 2025-12-04T09:26:59.7903018Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_dispatch/test_dispatch-0e2e8df624d84805.xml 2025-12-04T09:26:59.8217911Z Running nn/test_convolution 1/1 ... [2025-12-04 09:26:59.821546][2063.485365241] 2025-12-04T09:26:59.8218626Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:26:59.8221344Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'nn/test_convolution.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:26:59.821853] 2025-12-04T09:27:04.3258140Z 2025-12-04T09:27:04.3259327Z nn/test_convolution 1/1 was successful, full logs can be found in artifacts with path test/test-reports/nn.test_convolution_1.1_6bed80e83e42a622_.log 2025-12-04T09:27:04.3274002Z Running 50 items in this shard: test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda 2025-12-04T09:27:04.3287967Z 2025-12-04T09:27:04.3288169Z Finished nn/test_convolution 1/1 ... [2025-12-04 09:27:04.325573][2067.989389791], took 0.08min 2025-12-04T09:27:04.3311614Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-80a8ed73694cb0b2.xml 2025-12-04T09:27:04.3686480Z Running test_cpp_extensions_jit 1/1 ... [2025-12-04 09:27:04.368400][2068.032219227] 2025-12-04T09:27:04.3687018Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:04.3690124Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_jit.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:04.368712] 2025-12-04T09:27:07.6702293Z 2025-12-04T09:27:07.6703199Z test_cpp_extensions_jit 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_jit_1.1_7536298b4447f40e_.log 2025-12-04T09:27:07.6703964Z Running 0 items in this shard: 2025-12-04T09:27:07.6704153Z 2025-12-04T09:27:07.6704450Z Finished test_cpp_extensions_jit 1/1 ... [2025-12-04 09:27:07.669963][2071.333779418], took 0.06min 2025-12-04T09:27:07.6755832Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_jit/test_cpp_extensions_jit-9b4bc2024c539956.xml 2025-12-04T09:27:07.7014731Z Running test_nn 1/1 ... [2025-12-04 09:27:07.701244][2071.365063471] 2025-12-04T09:27:07.7015156Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:07.7018648Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_nn.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:07.701553] 2025-12-04T09:27:15.0359257Z 2025-12-04T09:27:15.0360015Z test_nn 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_nn_1.1_23aeac1c3d14f337_.log 2025-12-04T09:27:15.0387828Z Running 150 items in this shard: test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32 2025-12-04T09:27:15.0412827Z 2025-12-04T09:27:15.0412986Z Finished test_nn 1/1 ... [2025-12-04 09:27:15.035776][2078.699592484], took 0.12min 2025-12-04T09:27:15.0416987Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_nn/test_nn-bad95e310b75086e.xml 2025-12-04T09:27:15.1356834Z Running test_multiprocessing_spawn 1/1 ... [2025-12-04 09:27:15.135448][2078.799267114] 2025-12-04T09:27:15.1357335Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:15.1360512Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_multiprocessing_spawn.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:15.135772] 2025-12-04T09:27:22.9416213Z 2025-12-04T09:27:22.9417006Z test_multiprocessing_spawn 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_multiprocessing_spawn_1.1_7b3cc8d6efb8b7a0_.log 2025-12-04T09:27:22.9417611Z Running 0 items in this shard: 2025-12-04T09:27:22.9417748Z 2025-12-04T09:27:22.9417969Z Finished test_multiprocessing_spawn 1/1 ... [2025-12-04 09:27:22.941416][2086.605230675], took 0.13min 2025-12-04T09:27:22.9483973Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_multiprocessing_spawn/test_multiprocessing_spawn-10678c829390d154.xml 2025-12-04T09:27:23.0248013Z Running nn/test_pooling 1/1 ... [2025-12-04 09:27:23.024536][2086.688355573] 2025-12-04T09:27:23.0248580Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:23.0251534Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'nn/test_pooling.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:23.024860] 2025-12-04T09:27:26.4777581Z 2025-12-04T09:27:26.4778921Z nn/test_pooling 1/1 was successful, full logs can be found in artifacts with path test/test-reports/nn.test_pooling_1.1_f8fdb3392a1afe99_.log 2025-12-04T09:27:26.4779759Z Running 0 items in this shard: 2025-12-04T09:27:26.4780102Z 2025-12-04T09:27:26.4780379Z Finished nn/test_pooling 1/1 ... [2025-12-04 09:27:26.477531][2090.141346981], took 0.06min 2025-12-04T09:27:26.4845676Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/nn.test_pooling/nn.test_pooling-81f3ed8b48e81d4f.xml 2025-12-04T09:27:26.5099001Z Running test_cuda_trace 1/1 ... [2025-12-04 09:27:26.509650][2090.173468238] 2025-12-04T09:27:26.5099533Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:26.5102915Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cuda_trace.py', '--shard-id=1', '--num-shards=1', '-v', '--subprocess', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:26.509974] 2025-12-04T09:27:29.2740643Z 2025-12-04T09:27:29.2741504Z test_cuda_trace 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cuda_trace_1.1_dd993c3a642ff88e_.log 2025-12-04T09:27:29.2742337Z Running 0 items in this shard: 2025-12-04T09:27:29.2742547Z 2025-12-04T09:27:29.2743093Z Finished test_cuda_trace 1/1 ... [2025-12-04 09:27:29.273828][2092.937643126], took 0.05min 2025-12-04T09:27:29.2811802Z Running test_native_mha 1/1 ... [2025-12-04 09:27:29.280967][2092.944786392] 2025-12-04T09:27:29.2812390Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:29.2815819Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_native_mha.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:29.281297] 2025-12-04T09:27:32.5763509Z 2025-12-04T09:27:32.5764406Z test_native_mha 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_native_mha_1.1_da5e8954db1fb20b_.log 2025-12-04T09:27:32.5765178Z Running 0 items in this shard: 2025-12-04T09:27:32.5765378Z 2025-12-04T09:27:32.5765637Z Finished test_native_mha 1/1 ... [2025-12-04 09:27:32.576092][2096.239907305], took 0.05min 2025-12-04T09:27:32.5834711Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_native_mha/test_native_mha-93061de1bb56c8a8.xml 2025-12-04T09:27:32.6646436Z Running test_cuda_nvml_based_avail 1/1 ... [2025-12-04 09:27:32.664390][2096.328209602] 2025-12-04T09:27:32.6647033Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:32.6650338Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cuda_nvml_based_avail.py', '--shard-id=1', '--num-shards=1', '-v', '--subprocess', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:32.664720] 2025-12-04T09:27:35.4419993Z 2025-12-04T09:27:35.4421014Z test_cuda_nvml_based_avail 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cuda_nvml_based_avail_1.1_b3fe187c64b13f60_.log 2025-12-04T09:27:35.4421933Z Running 0 items in this shard: 2025-12-04T09:27:35.4422140Z 2025-12-04T09:27:35.4422472Z Finished test_cuda_nvml_based_avail 1/1 ... [2025-12-04 09:27:35.441779][2099.105593934], took 0.05min 2025-12-04T09:27:35.4483564Z Running test_mobile_optimizer 1/1 ... [2025-12-04 09:27:35.448116][2099.111934895] 2025-12-04T09:27:35.4484088Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:35.4487716Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_mobile_optimizer.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:35.448462] 2025-12-04T09:27:38.8068280Z 2025-12-04T09:27:38.8069235Z test_mobile_optimizer 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_mobile_optimizer_1.1_6bfbdcc4fadadd51_.log 2025-12-04T09:27:38.8097829Z Running 100 items in this shard: test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures 2025-12-04T09:27:38.8126800Z 2025-12-04T09:27:38.8127135Z Finished test_mobile_optimizer 1/1 ... [2025-12-04 09:27:38.806806][2102.47062048], took 0.06min 2025-12-04T09:27:38.8130615Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_mobile_optimizer/test_mobile_optimizer-f5ddc3c2ec167b4c.xml 2025-12-04T09:27:38.8947620Z Running test_overrides 1/1 ... [2025-12-04 09:27:38.894504][2102.558322847] 2025-12-04T09:27:38.8948115Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:38.8951181Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_overrides.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:38.894822] 2025-12-04T09:27:44.5582170Z 2025-12-04T09:27:44.5582989Z test_overrides 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_overrides_1.1_cccd5c265cf6e4d2_.log 2025-12-04T09:27:44.5635348Z Running 250 items in this shard: test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api 2025-12-04T09:27:44.5685423Z 2025-12-04T09:27:44.5685618Z Finished test_overrides 1/1 ... [2025-12-04 09:27:44.558345][2108.222160585], took 0.09min 2025-12-04T09:27:44.5686276Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_overrides/test_overrides-52ce9f408887fb01.xml 2025-12-04T09:27:44.5981336Z Running test_cuda_primary_ctx 1/1 ... [2025-12-04 09:27:44.597893][2108.261711582] 2025-12-04T09:27:44.5981754Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:44.5984958Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cuda_primary_ctx.py', '--shard-id=1', '--num-shards=1', '-v', '--subprocess', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:44.598219] 2025-12-04T09:27:47.6796900Z 2025-12-04T09:27:47.6798037Z test_cuda_primary_ctx 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cuda_primary_ctx_1.1_41fde96bae32b26e_.log 2025-12-04T09:27:47.6798813Z Running 0 items in this shard: 2025-12-04T09:27:47.6798991Z 2025-12-04T09:27:47.6799373Z Finished test_cuda_primary_ctx 1/1 ... [2025-12-04 09:27:47.679449][2111.343265924], took 0.05min 2025-12-04T09:27:47.6863152Z Running test_reductions 1/1 ... [2025-12-04 09:27:47.686079][2111.349898381] 2025-12-04T09:27:47.6863578Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:47.6867042Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_reductions.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:47.686447] 2025-12-04T09:27:58.5106690Z 2025-12-04T09:27:58.5107612Z test_reductions 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_reductions_1.1_9ab6c8b8ff1ac059_.log 2025-12-04T09:27:58.5108270Z Running 0 items in this shard: 2025-12-04T09:27:58.5108442Z 2025-12-04T09:27:58.5108668Z Finished test_reductions 1/1 ... [2025-12-04 09:27:58.510444][2122.174259387], took 0.18min 2025-12-04T09:27:58.5170413Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_reductions/test_reductions-1dbe12b386b7a769.xml 2025-12-04T09:27:58.6020596Z Running test_spectral_ops 1/1 ... [2025-12-04 09:27:58.601745][2122.265563156] 2025-12-04T09:27:58.6021046Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:58.6023195Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_spectral_ops.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:58.602062] 2025-12-04T09:28:03.3148447Z 2025-12-04T09:28:03.3149354Z test_spectral_ops 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_spectral_ops_1.1_1bcbe1253863dd88_.log 2025-12-04T09:28:03.3150072Z Running 0 items in this shard: 2025-12-04T09:28:03.3150251Z 2025-12-04T09:28:03.3150508Z Finished test_spectral_ops 1/1 ... [2025-12-04 09:28:03.314581][2126.978395363], took 0.08min 2025-12-04T09:28:03.3224497Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_spectral_ops/test_spectral_ops-864e0cd8d7f784b9.xml 2025-12-04T09:28:03.3494861Z Running distributions/test_distributions 1/1 ... [2025-12-04 09:28:03.349242][2127.013061224] 2025-12-04T09:28:03.3495415Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:28:03.3498613Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributions/test_distributions.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:28:03.349561] 2025-12-04T09:28:07.1868170Z 2025-12-04T09:28:07.1868949Z distributions/test_distributions 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributions.test_distributions_1.1_5253b86f7af9d941_.log 2025-12-04T09:28:07.1882069Z Running 50 items in this shard: test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample 2025-12-04T09:28:07.1894576Z 2025-12-04T09:28:07.1894833Z Finished distributions/test_distributions 1/1 ... [2025-12-04 09:28:07.186706][2130.850521567], took 0.06min 2025-12-04T09:28:07.1943732Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/distributions.test_distributions/distributions.test_distributions-26c49ab0bf9a2a75.xml 2025-12-04T09:28:07.2245289Z Running test_autoload_disable 1/1 ... [2025-12-04 09:28:07.224313][2130.88813163] 2025-12-04T09:28:07.5108691Z Processing /var/lib/jenkins/workspace/test/cpp_extensions 2025-12-04T09:28:10.2845851Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:28:10.2868136Z [?25hBuilding wheels for collected packages: torch_test_cpp_extension 2025-12-04T09:29:24.6315483Z Building wheel for torch_test_cpp_extension (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-12-04T09:29:24.6427683Z [?25h Created wheel for torch_test_cpp_extension: filename=torch_test_cpp_extension-0.0.0-cp310-cp310-linux_x86_64.whl size=13199649 sha256=680c48568d4df629de84ff5b48a9e14ebd8c69b92b6fde3800c9d46feacf45fd 2025-12-04T09:29:24.6429029Z Stored in directory: /tmp/pip-ephem-wheel-cache-t286r6i4/wheels/2b/79/8d/635cf291e138cfea331292ca746c62b61fade208eb55a7e3a1 2025-12-04T09:29:24.6448792Z Successfully built torch_test_cpp_extension 2025-12-04T09:29:24.9534738Z Installing collected packages: torch_test_cpp_extension 2025-12-04T09:29:25.1255925Z Successfully installed torch_test_cpp_extension-0.0.0 2025-12-04T09:29:27.4129939Z 2025-12-04T09:29:27.4130375Z Running tests... 2025-12-04T09:29:27.4130784Z ---------------------------------------------------------------------- 2025-12-04T09:29:27.6887839Z . 2025-12-04T09:29:27.6888244Z ---------------------------------------------------------------------- 2025-12-04T09:29:27.6888731Z Ran 1 test in 0.276s 2025-12-04T09:29:27.6888918Z 2025-12-04T09:29:27.6889012Z OK 2025-12-04T09:29:27.6889143Z 2025-12-04T09:29:27.6889266Z Generating XML reports... 2025-12-04T09:29:28.2856486Z Finished test_autoload_disable 1/1 ... [2025-12-04 09:29:28.285147][2211.948954921], took 1.35min 2025-12-04T09:29:28.2925856Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204092927.xml 2025-12-04T09:29:28.3946574Z Running test_autoload_enable 1/1 ... [2025-12-04 09:29:28.394397][2212.058215469] 2025-12-04T09:29:28.6945687Z Processing /var/lib/jenkins/workspace/test/cpp_extensions 2025-12-04T09:29:31.4503707Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:29:31.4524899Z [?25hBuilding wheels for collected packages: torch_test_cpp_extension 2025-12-04T09:30:44.6818648Z Building wheel for torch_test_cpp_extension (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-12-04T09:30:44.6929700Z [?25h Created wheel for torch_test_cpp_extension: filename=torch_test_cpp_extension-0.0.0-cp310-cp310-linux_x86_64.whl size=13199649 sha256=1943cbe909777b05cd8d1ed8d7e3a1b249ed0cefc29fa951b8ff46a93c22c54b 2025-12-04T09:30:44.6931241Z Stored in directory: /tmp/pip-ephem-wheel-cache-yqxauekq/wheels/2b/79/8d/635cf291e138cfea331292ca746c62b61fade208eb55a7e3a1 2025-12-04T09:30:44.6948997Z Successfully built torch_test_cpp_extension 2025-12-04T09:30:44.9972512Z Installing collected packages: torch_test_cpp_extension 2025-12-04T09:30:45.1806216Z Successfully installed torch_test_cpp_extension-0.0.0 2025-12-04T09:30:47.4774975Z 2025-12-04T09:30:47.4775493Z Running tests... 2025-12-04T09:30:47.4775884Z ---------------------------------------------------------------------- 2025-12-04T09:30:47.7522014Z . 2025-12-04T09:30:47.7522396Z ---------------------------------------------------------------------- 2025-12-04T09:30:47.7522837Z Ran 1 test in 0.275s 2025-12-04T09:30:47.7523015Z 2025-12-04T09:30:47.7523113Z OK 2025-12-04T09:30:47.7523231Z 2025-12-04T09:30:47.7523352Z Generating XML reports... 2025-12-04T09:30:48.3490511Z Finished test_autoload_enable 1/1 ... [2025-12-04 09:30:48.348589][2292.012395849], took 1.33min 2025-12-04T09:30:48.3561336Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204093047.xml 2025-12-04T09:30:48.4363217Z Running test_cpp_extensions_aot_ninja 1/1 ... [2025-12-04 09:30:48.436036][2292.099853842] 2025-12-04T09:30:48.7633939Z Processing /var/lib/jenkins/workspace/test/cpp_extensions 2025-12-04T09:30:51.5619660Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:30:51.5640815Z [?25hBuilding wheels for collected packages: torch_test_cpp_extension 2025-12-04T09:32:06.2610667Z Building wheel for torch_test_cpp_extension (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-12-04T09:32:06.2742137Z [?25h Created wheel for torch_test_cpp_extension: filename=torch_test_cpp_extension-0.0.0-cp310-cp310-linux_x86_64.whl size=16081740 sha256=0f7f25e35f445ddb6965a84fe7b414033a47f95e3b645039eab52bb93744273a 2025-12-04T09:32:06.2744551Z Stored in directory: /tmp/pip-ephem-wheel-cache-_4ggd69_/wheels/2b/79/8d/635cf291e138cfea331292ca746c62b61fade208eb55a7e3a1 2025-12-04T09:32:06.2761080Z Successfully built torch_test_cpp_extension 2025-12-04T09:32:06.5816827Z Installing collected packages: torch_test_cpp_extension 2025-12-04T09:32:06.7979566Z Successfully installed torch_test_cpp_extension-0.0.0 2025-12-04T09:32:07.1312479Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/no_python_abi_suffix_test 2025-12-04T09:32:08.7443999Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:32:08.7466048Z [?25hBuilding wheels for collected packages: no_python_abi_suffix_test 2025-12-04T09:32:10.6787681Z Building wheel for no_python_abi_suffix_test (pyproject.toml) ... [?25l- \ | done 2025-12-04T09:32:10.6796929Z [?25h Created wheel for no_python_abi_suffix_test: filename=no_python_abi_suffix_test-0.0.0-cp310-cp310-linux_x86_64.whl size=2944 sha256=7df2d220d15002f044168a7622c666924036dc315d4d38b2fda6ff316d3b7840 2025-12-04T09:32:10.6798536Z Stored in directory: /tmp/pip-ephem-wheel-cache-7zbyhs_y/wheels/8c/c7/11/bcf2bfbdebb3cf78b8211ac54acc945a8fdf1732548d147a80 2025-12-04T09:32:10.6816645Z Successfully built no_python_abi_suffix_test 2025-12-04T09:32:10.9829708Z Installing collected packages: no_python_abi_suffix_test 2025-12-04T09:32:10.9930615Z Successfully installed no_python_abi_suffix_test-0.0.0 2025-12-04T09:32:11.0821016Z * Getting build dependencies for wheel... 2025-12-04T09:32:12.3926232Z running egg_info 2025-12-04T09:32:12.3997585Z creating python_agnostic.egg-info 2025-12-04T09:32:12.3998718Z writing python_agnostic.egg-info/PKG-INFO 2025-12-04T09:32:12.4003054Z writing dependency_links to python_agnostic.egg-info/dependency_links.txt 2025-12-04T09:32:12.4005731Z writing top-level names to python_agnostic.egg-info/top_level.txt 2025-12-04T09:32:12.4006930Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:32:12.4419657Z reading manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:32:12.4426752Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:32:12.7400941Z * Building wheel... 2025-12-04T09:32:14.0286003Z running bdist_wheel 2025-12-04T09:32:14.0827711Z running build 2025-12-04T09:32:14.0828021Z running build_ext 2025-12-04T09:32:14.0861283Z building 'python_agnostic._C' extension 2025-12-04T09:32:14.0864994Z creating /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc 2025-12-04T09:32:24.4853606Z [1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.o.d -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/opt/conda/envs/py_3.10/include/python3.10 -c -c /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.cu -o /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x030A0000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_89,code=sm_89 -std=c++17 2025-12-04T09:32:24.4920154Z creating build/lib.linux-x86_64-cpython-310/python_agnostic 2025-12-04T09:32:24.4924522Z g++ -pthread -B /opt/conda/envs/py_3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/envs/py_3.10/include -fPIC -O2 -isystem /opt/conda/envs/py_3.10/include -pthread -B /opt/conda/envs/py_3.10/compiler_compat -shared /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.o -L/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/python_agnostic/_C.so 2025-12-04T09:32:24.8255035Z installing to build/bdist.linux-x86_64/wheel 2025-12-04T09:32:24.8255457Z running install 2025-12-04T09:32:24.8294120Z running install_lib 2025-12-04T09:32:24.8361085Z creating build/bdist.linux-x86_64/wheel 2025-12-04T09:32:24.8363864Z creating build/bdist.linux-x86_64/wheel/python_agnostic 2025-12-04T09:32:24.8365179Z copying build/lib.linux-x86_64-cpython-310/python_agnostic/_C.so -> build/bdist.linux-x86_64/wheel/./python_agnostic 2025-12-04T09:32:24.8370786Z running install_egg_info 2025-12-04T09:32:24.8438627Z running egg_info 2025-12-04T09:32:24.8500609Z writing python_agnostic.egg-info/PKG-INFO 2025-12-04T09:32:24.8504645Z writing dependency_links to python_agnostic.egg-info/dependency_links.txt 2025-12-04T09:32:24.8514695Z writing top-level names to python_agnostic.egg-info/top_level.txt 2025-12-04T09:32:24.8588339Z reading manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:32:24.8596783Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:32:24.8604324Z Copying python_agnostic.egg-info to build/bdist.linux-x86_64/wheel/./python_agnostic-0.0-py3.10.egg-info 2025-12-04T09:32:24.8611091Z running install_scripts 2025-12-04T09:32:24.8719840Z creating build/bdist.linux-x86_64/wheel/python_agnostic-0.0.dist-info/WHEEL 2025-12-04T09:32:24.8724537Z creating '/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/dist/.tmp-f46q_09j/python_agnostic-0.0-cp39-abi3-linux_x86_64.whl' and adding 'build/bdist.linux-x86_64/wheel' to it 2025-12-04T09:32:24.8880924Z adding 'python_agnostic/_C.so' 2025-12-04T09:32:24.8886505Z adding 'python_agnostic-0.0.dist-info/METADATA' 2025-12-04T09:32:24.8887585Z adding 'python_agnostic-0.0.dist-info/WHEEL' 2025-12-04T09:32:24.8888725Z adding 'python_agnostic-0.0.dist-info/top_level.txt' 2025-12-04T09:32:24.8890127Z adding 'python_agnostic-0.0.dist-info/RECORD' 2025-12-04T09:32:24.8890802Z removing build/bdist.linux-x86_64/wheel 2025-12-04T09:32:25.1547869Z Successfully built python_agnostic-0.0-cp39-abi3-linux_x86_64.whl 2025-12-04T09:32:25.4547471Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/libtorch_agnostic_2_9_extension 2025-12-04T09:32:27.0914413Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:32:27.0937776Z [?25hRequirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from libtorch_agnostic_2_9==0.0) (2.10.0a0+gitffd9b0f) 2025-12-04T09:32:27.0960997Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (3.18.0) 2025-12-04T09:32:27.0966081Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (4.12.2) 2025-12-04T09:32:27.0970438Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (1.13.3) 2025-12-04T09:32:27.0974697Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (2.8.8) 2025-12-04T09:32:27.0977820Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (3.1.6) 2025-12-04T09:32:27.0982077Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (2025.10.0) 2025-12-04T09:32:27.1308270Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch->libtorch_agnostic_2_9==0.0) (1.3.0) 2025-12-04T09:32:27.1356024Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->libtorch_agnostic_2_9==0.0) (3.0.3) 2025-12-04T09:32:27.1364789Z Building wheels for collected packages: libtorch_agnostic_2_9 2025-12-04T09:32:32.0691000Z Building wheel for libtorch_agnostic_2_9 (pyproject.toml) ... [?25l- \ | / done 2025-12-04T09:32:32.0699722Z [?25h Created wheel for libtorch_agnostic_2_9: filename=libtorch_agnostic_2_9-0.0-cp39-abi3-linux_x86_64.whl size=55939 sha256=23c8af5fa1a6afdf7779e74d9a9fd7afbcfe48609e25cbdc4a27d50b9425e938 2025-12-04T09:32:32.0701075Z Stored in directory: /tmp/pip-ephem-wheel-cache-yhk00zgo/wheels/e1/56/0d/91ac1e918c8015b48f6a77f66abeeb8427a8788f7d37715e0e 2025-12-04T09:32:32.0719215Z Successfully built libtorch_agnostic_2_9 2025-12-04T09:32:32.3383384Z Installing collected packages: libtorch_agnostic_2_9 2025-12-04T09:32:32.3496429Z Successfully installed libtorch_agnostic_2_9-0.0 2025-12-04T09:32:32.6739548Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/libtorch_agnostic_2_10_extension 2025-12-04T09:32:34.3182432Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:32:34.3205172Z [?25hRequirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from libtorch_agnostic_2_10==0.0) (2.10.0a0+gitffd9b0f) 2025-12-04T09:32:34.3228287Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (3.18.0) 2025-12-04T09:32:34.3232914Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (4.12.2) 2025-12-04T09:32:34.3237203Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (1.13.3) 2025-12-04T09:32:34.3241540Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (2.8.8) 2025-12-04T09:32:34.3244790Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (3.1.6) 2025-12-04T09:32:34.3248985Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (2025.10.0) 2025-12-04T09:32:34.3568734Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch->libtorch_agnostic_2_10==0.0) (1.3.0) 2025-12-04T09:32:34.3615929Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->libtorch_agnostic_2_10==0.0) (3.0.3) 2025-12-04T09:32:34.3624611Z Building wheels for collected packages: libtorch_agnostic_2_10 2025-12-04T09:32:39.6868060Z Building wheel for libtorch_agnostic_2_10 (pyproject.toml) ... [?25l- \ | / - \ done 2025-12-04T09:32:39.6878103Z [?25h Created wheel for libtorch_agnostic_2_10: filename=libtorch_agnostic_2_10-0.0-cp39-abi3-linux_x86_64.whl size=83384 sha256=37aa4c61e845e22da1163f9de0e0fff13d8b10a393da8f14939b88248ac0d773 2025-12-04T09:32:39.6879706Z Stored in directory: /tmp/pip-ephem-wheel-cache-vzstvz5v/wheels/03/17/c4/d9b9dbd12b271a9a317a75e944d0966701385d67eac86f2c1a 2025-12-04T09:32:39.6897491Z Successfully built libtorch_agnostic_2_10 2025-12-04T09:32:39.9565508Z Installing collected packages: libtorch_agnostic_2_10 2025-12-04T09:32:39.9738935Z Successfully installed libtorch_agnostic_2_10-0.0 2025-12-04T09:32:40.0178015Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:32:40.0182038Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_aot_ninja.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:32:40.017870] 2025-12-04T09:32:43.1512855Z 2025-12-04T09:32:43.1513943Z test_cpp_extensions_aot_ninja 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_aot_ninja_1.1_56e7b64c9caaf566_.log 2025-12-04T09:32:43.1514863Z Running 0 items in this shard: 2025-12-04T09:32:43.1515036Z 2025-12-04T09:32:43.1515329Z Finished test_cpp_extensions_aot_ninja 1/1 ... [2025-12-04 09:32:43.151184][2406.81499992], took 1.91min 2025-12-04T09:32:43.1585272Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_aot_ninja/test_cpp_extensions_aot_ninja-da282566f73e8ede.xml 2025-12-04T09:32:43.2644108Z Running test_cpp_extensions_aot_no_ninja 1/1 ... [2025-12-04 09:32:43.264156][2406.927975397] 2025-12-04T09:32:43.5654430Z Processing /var/lib/jenkins/workspace/test/cpp_extensions 2025-12-04T09:32:46.3281951Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:32:46.3303881Z [?25hBuilding wheels for collected packages: torch_test_cpp_extension 2025-12-04T09:33:59.4674917Z Building wheel for torch_test_cpp_extension (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-12-04T09:33:59.4785298Z [?25h Created wheel for torch_test_cpp_extension: filename=torch_test_cpp_extension-0.0.0-cp310-cp310-linux_x86_64.whl size=13199649 sha256=e2e7d3280cb08ca251fbc4f8af3e7430aba668520bac9eb2ecaa8c471d747148 2025-12-04T09:33:59.4788168Z Stored in directory: /tmp/pip-ephem-wheel-cache-iui_q__r/wheels/2b/79/8d/635cf291e138cfea331292ca746c62b61fade208eb55a7e3a1 2025-12-04T09:33:59.4804129Z Successfully built torch_test_cpp_extension 2025-12-04T09:33:59.7847352Z Installing collected packages: torch_test_cpp_extension 2025-12-04T09:33:59.9610134Z Successfully installed torch_test_cpp_extension-0.0.0 2025-12-04T09:34:00.2936200Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/no_python_abi_suffix_test 2025-12-04T09:34:01.9125912Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:34:01.9148106Z [?25hBuilding wheels for collected packages: no_python_abi_suffix_test 2025-12-04T09:34:03.6891503Z Building wheel for no_python_abi_suffix_test (pyproject.toml) ... [?25l- \ done 2025-12-04T09:34:03.6900069Z [?25h Created wheel for no_python_abi_suffix_test: filename=no_python_abi_suffix_test-0.0.0-cp310-cp310-linux_x86_64.whl size=2944 sha256=0f360ea653cedaaf2533550c011896ed877661afc39747a09ed860ececc521cd 2025-12-04T09:34:03.6901563Z Stored in directory: /tmp/pip-ephem-wheel-cache-ujn7_og4/wheels/8c/c7/11/bcf2bfbdebb3cf78b8211ac54acc945a8fdf1732548d147a80 2025-12-04T09:34:03.6919912Z Successfully built no_python_abi_suffix_test 2025-12-04T09:34:03.9924119Z Installing collected packages: no_python_abi_suffix_test 2025-12-04T09:34:04.0021455Z Successfully installed no_python_abi_suffix_test-0.0.0 2025-12-04T09:34:04.0906454Z * Getting build dependencies for wheel... 2025-12-04T09:34:05.3837904Z running egg_info 2025-12-04T09:34:05.3910212Z writing python_agnostic.egg-info/PKG-INFO 2025-12-04T09:34:05.3914784Z writing dependency_links to python_agnostic.egg-info/dependency_links.txt 2025-12-04T09:34:05.3917747Z writing top-level names to python_agnostic.egg-info/top_level.txt 2025-12-04T09:34:05.4327903Z reading manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:34:05.4336336Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:34:05.7203812Z * Building wheel... 2025-12-04T09:34:07.0018084Z running bdist_wheel 2025-12-04T09:34:07.0559174Z running build 2025-12-04T09:34:07.0559681Z running build_ext 2025-12-04T09:34:07.0594038Z building 'python_agnostic._C' extension 2025-12-04T09:34:07.1230602Z ninja: no work to do. 2025-12-04T09:34:07.1268288Z g++ -pthread -B /opt/conda/envs/py_3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/envs/py_3.10/include -fPIC -O2 -isystem /opt/conda/envs/py_3.10/include -pthread -B /opt/conda/envs/py_3.10/compiler_compat -shared /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.o -L/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/python_agnostic/_C.so 2025-12-04T09:34:07.4553997Z installing to build/bdist.linux-x86_64/wheel 2025-12-04T09:34:07.4554456Z running install 2025-12-04T09:34:07.4593451Z running install_lib 2025-12-04T09:34:07.4660923Z creating build/bdist.linux-x86_64/wheel 2025-12-04T09:34:07.4663253Z creating build/bdist.linux-x86_64/wheel/python_agnostic 2025-12-04T09:34:07.4664441Z copying build/lib.linux-x86_64-cpython-310/python_agnostic/_C.so -> build/bdist.linux-x86_64/wheel/./python_agnostic 2025-12-04T09:34:07.4670008Z running install_egg_info 2025-12-04T09:34:07.4737571Z running egg_info 2025-12-04T09:34:07.4800328Z writing python_agnostic.egg-info/PKG-INFO 2025-12-04T09:34:07.4804209Z writing dependency_links to python_agnostic.egg-info/dependency_links.txt 2025-12-04T09:34:07.4807173Z writing top-level names to python_agnostic.egg-info/top_level.txt 2025-12-04T09:34:07.4875115Z reading manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:34:07.4883395Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T09:34:07.4885244Z Copying python_agnostic.egg-info to build/bdist.linux-x86_64/wheel/./python_agnostic-0.0-py3.10.egg-info 2025-12-04T09:34:07.4892230Z running install_scripts 2025-12-04T09:34:07.5000583Z creating build/bdist.linux-x86_64/wheel/python_agnostic-0.0.dist-info/WHEEL 2025-12-04T09:34:07.5004385Z creating '/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/dist/.tmp-1czvksv3/python_agnostic-0.0-cp39-abi3-linux_x86_64.whl' and adding 'build/bdist.linux-x86_64/wheel' to it 2025-12-04T09:34:07.5158876Z adding 'python_agnostic/_C.so' 2025-12-04T09:34:07.5165384Z adding 'python_agnostic-0.0.dist-info/METADATA' 2025-12-04T09:34:07.5166409Z adding 'python_agnostic-0.0.dist-info/WHEEL' 2025-12-04T09:34:07.5167585Z adding 'python_agnostic-0.0.dist-info/top_level.txt' 2025-12-04T09:34:07.5169289Z adding 'python_agnostic-0.0.dist-info/RECORD' 2025-12-04T09:34:07.5169778Z removing build/bdist.linux-x86_64/wheel 2025-12-04T09:34:07.7684878Z Successfully built python_agnostic-0.0-cp39-abi3-linux_x86_64.whl 2025-12-04T09:34:08.0653247Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/libtorch_agnostic_2_9_extension 2025-12-04T09:34:09.6967724Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:34:09.6989075Z [?25hRequirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from libtorch_agnostic_2_9==0.0) (2.10.0a0+gitffd9b0f) 2025-12-04T09:34:09.7012169Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (3.18.0) 2025-12-04T09:34:09.7017270Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (4.12.2) 2025-12-04T09:34:09.7021084Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (1.13.3) 2025-12-04T09:34:09.7025384Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (2.8.8) 2025-12-04T09:34:09.7028594Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (3.1.6) 2025-12-04T09:34:09.7032774Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (2025.10.0) 2025-12-04T09:34:09.7362294Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch->libtorch_agnostic_2_9==0.0) (1.3.0) 2025-12-04T09:34:09.7411295Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->libtorch_agnostic_2_9==0.0) (3.0.3) 2025-12-04T09:34:09.7420123Z Building wheels for collected packages: libtorch_agnostic_2_9 2025-12-04T09:34:11.7256988Z Building wheel for libtorch_agnostic_2_9 (pyproject.toml) ... [?25l- \ done 2025-12-04T09:34:11.7265993Z [?25h Created wheel for libtorch_agnostic_2_9: filename=libtorch_agnostic_2_9-0.0-cp39-abi3-linux_x86_64.whl size=55939 sha256=2886a911346653b26d67b826d908c0135754be1627783f28a6bb66203b7c86e5 2025-12-04T09:34:11.7267265Z Stored in directory: /tmp/pip-ephem-wheel-cache-vvi2kx1s/wheels/e1/56/0d/91ac1e918c8015b48f6a77f66abeeb8427a8788f7d37715e0e 2025-12-04T09:34:11.7285937Z Successfully built libtorch_agnostic_2_9 2025-12-04T09:34:11.9975658Z Installing collected packages: libtorch_agnostic_2_9 2025-12-04T09:34:12.0092497Z Successfully installed libtorch_agnostic_2_9-0.0 2025-12-04T09:34:12.3349685Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/libtorch_agnostic_2_10_extension 2025-12-04T09:34:13.9687981Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T09:34:13.9710125Z [?25hRequirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from libtorch_agnostic_2_10==0.0) (2.10.0a0+gitffd9b0f) 2025-12-04T09:34:13.9733173Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (3.18.0) 2025-12-04T09:34:13.9737796Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (4.12.2) 2025-12-04T09:34:13.9741905Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (1.13.3) 2025-12-04T09:34:13.9745942Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (2.8.8) 2025-12-04T09:34:13.9749079Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (3.1.6) 2025-12-04T09:34:13.9752744Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (2025.10.0) 2025-12-04T09:34:14.0078896Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch->libtorch_agnostic_2_10==0.0) (1.3.0) 2025-12-04T09:34:14.0127125Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->libtorch_agnostic_2_10==0.0) (3.0.3) 2025-12-04T09:34:14.0135705Z Building wheels for collected packages: libtorch_agnostic_2_10 2025-12-04T09:34:16.0717787Z Building wheel for libtorch_agnostic_2_10 (pyproject.toml) ... [?25l- \ done 2025-12-04T09:34:16.0726561Z [?25h Created wheel for libtorch_agnostic_2_10: filename=libtorch_agnostic_2_10-0.0-cp39-abi3-linux_x86_64.whl size=83384 sha256=0faff8056dcc8db13572547420b6227a9730298fbdcb22fe5f0e94f192e1f19d 2025-12-04T09:34:16.0728009Z Stored in directory: /tmp/pip-ephem-wheel-cache-kpxg2evl/wheels/03/17/c4/d9b9dbd12b271a9a317a75e944d0966701385d67eac86f2c1a 2025-12-04T09:34:16.0746199Z Successfully built libtorch_agnostic_2_10 2025-12-04T09:34:16.3433813Z Installing collected packages: libtorch_agnostic_2_10 2025-12-04T09:34:16.3556228Z Successfully installed libtorch_agnostic_2_10-0.0 2025-12-04T09:34:16.3974278Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:34:16.3979224Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_aot_no_ninja.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:34:16.397616] 2025-12-04T09:34:19.4631414Z 2025-12-04T09:34:19.4632355Z test_cpp_extensions_aot_no_ninja 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_aot_no_ninja_1.1_7f0ee8713838d52a_.log 2025-12-04T09:34:19.4633119Z Running 0 items in this shard: 2025-12-04T09:34:19.4633289Z 2025-12-04T09:34:19.4633817Z Finished test_cpp_extensions_aot_no_ninja 1/1 ... [2025-12-04 09:34:19.463047][2503.126863211], took 1.60min 2025-12-04T09:34:19.4704564Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_aot_no_ninja/test_cpp_extensions_aot_no_ninja-6de3ce07151778a3.xml 2025-12-04T09:34:19.5419748Z Running inductor/test_collective_autotuning 1/1 ... [2025-12-04 09:34:19.541733][2503.205552211] 2025-12-04T09:34:19.5420212Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:34:19.5423212Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_collective_autotuning.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:34:19.542055] 2025-12-04T09:34:22.2777732Z 2025-12-04T09:34:22.2778721Z inductor/test_collective_autotuning 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_collective_autotuning_1.1_b691007e7842981d_.log 2025-12-04T09:34:22.2779560Z Running 0 items in this shard: 2025-12-04T09:34:22.2779733Z 2025-12-04T09:34:22.2780062Z Finished inductor/test_collective_autotuning 1/1 ... [2025-12-04 09:34:22.277571][2505.941387522], took 0.05min 2025-12-04T09:34:22.2848718Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-e098322b0776271e.xml 2025-12-04T09:34:22.3159560Z Running inductor/test_cpu_cpp_wrapper 1/1 ... [2025-12-04 09:34:22.315732][2505.979550706] 2025-12-04T09:34:22.3160086Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:34:22.3163679Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cpu_cpp_wrapper.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:34:22.316058] 2025-12-04T09:34:28.2997007Z 2025-12-04T09:34:28.2998044Z inductor/test_cpu_cpp_wrapper 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cpu_cpp_wrapper_1.1_a9797ad9104e521b_.log 2025-12-04T09:34:28.2998769Z 2025-12-04T09:34:28.2999089Z Finished inductor/test_cpu_cpp_wrapper 1/1 ... [2025-12-04 09:34:28.299422][2511.96323926], took 0.10min 2025-12-04T09:34:28.3074869Z Running inductor/test_autoheuristic 1/1 ... [2025-12-04 09:34:28.307227][2511.971047097] 2025-12-04T09:34:28.3075370Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:34:28.3078607Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_autoheuristic.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:34:28.307554] 2025-12-04T09:34:33.8093865Z 2025-12-04T09:34:33.8094889Z inductor/test_autoheuristic 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_autoheuristic_1.1_56a71428e4aab3c0_.log 2025-12-04T09:34:33.8095767Z Running 0 items in this shard: 2025-12-04T09:34:33.8095960Z 2025-12-04T09:34:33.8096299Z Finished inductor/test_autoheuristic 1/1 ... [2025-12-04 09:34:33.809140][2517.472956965], took 0.09min 2025-12-04T09:34:33.8167366Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_autoheuristic/inductor.test_autoheuristic-886885d891812a32.xml 2025-12-04T09:34:33.8824747Z Running inductor/test_control_flow 5/5 ... [2025-12-04 09:34:33.882212][2517.546030535] 2025-12-04T09:34:33.8825224Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:34:33.8828623Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_control_flow.py', '-m', 'serial', '--shard-id=5', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:34:33.882535] 2025-12-04T09:34:40.2446403Z 2025-12-04T09:34:40.2447313Z inductor/test_control_flow 5/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_control_flow_5.5_440a0d8507c2b373_.log 2025-12-04T09:34:40.2448069Z Running 0 items in this shard: 2025-12-04T09:34:40.2448243Z 2025-12-04T09:34:40.2448506Z Finished inductor/test_control_flow 5/5 ... [2025-12-04 09:34:40.244390][2523.9082081], took 0.11min 2025-12-04T09:34:40.2519430Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-949c20f2734be55d.xml 2025-12-04T09:34:40.3293998Z Running inductor/test_mps_basic 1/1 ... [2025-12-04 09:34:40.329129][2523.992949386] 2025-12-04T09:34:40.3294471Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:34:40.3297217Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_mps_basic.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:34:40.329433] 2025-12-04T09:34:46.0538324Z 2025-12-04T09:34:46.0539338Z inductor/test_mps_basic 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_mps_basic_1.1_eecb82439a0d6bc2_.log 2025-12-04T09:34:46.0540161Z 2025-12-04T09:34:46.0540507Z Finished inductor/test_mps_basic 1/1 ... [2025-12-04 09:34:46.053526][2529.717344562], took 0.10min 2025-12-04T09:34:46.0615726Z Running test_ops_fwd_gradients 3/3 ... [2025-12-04 09:34:46.061333][2529.725154139] 2025-12-04T09:34:46.0616240Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:34:46.0619390Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-m', 'serial', '--shard-id=3', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:34:46.061649] 2025-12-04T09:34:54.6503919Z 2025-12-04T09:34:54.6504764Z test_ops_fwd_gradients 3/3 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_fwd_gradients_3.3_a7026342df501131_.log 2025-12-04T09:34:54.6505474Z Running 0 items in this shard: 2025-12-04T09:34:54.6505666Z 2025-12-04T09:34:54.6505919Z Finished test_ops_fwd_gradients 3/3 ... [2025-12-04 09:34:54.650120][2538.313937637], took 0.14min 2025-12-04T09:34:54.6580095Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-eb1d57437e276b50.xml 2025-12-04T09:34:54.7272693Z Running test_meta 1/4 ... [2025-12-04 09:34:54.727013][2538.390830845] 2025-12-04T09:34:54.7273384Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:34:54.7276771Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_meta.py', '-m', 'serial', '--shard-id=1', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:34:54.727350] 2025-12-04T09:35:55.3249399Z 2025-12-04T09:35:55.3250249Z test_meta 1/4 was successful, full logs can be found in artifacts with path test/test-reports/test_meta_1.4_9f7909e7b7fd2ce6_.log 2025-12-04T09:35:55.3251376Z Running 0 items in this shard: 2025-12-04T09:35:55.3251600Z 2025-12-04T09:35:55.3251888Z Finished test_meta 1/4 ... [2025-12-04 09:35:55.324664][2598.98848089], took 1.01min 2025-12-04T09:35:55.3329405Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_meta/test_meta-7177ed662acb32c0.xml 2025-12-04T09:35:55.4077575Z Running test_meta 2/4 ... [2025-12-04 09:35:55.407507][2599.071325671] 2025-12-04T09:35:55.4078046Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:35:55.4081524Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_meta.py', '-m', 'serial', '--shard-id=2', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:35:55.407825] 2025-12-04T09:36:55.6476631Z 2025-12-04T09:36:55.6477354Z test_meta 2/4 was successful, full logs can be found in artifacts with path test/test-reports/test_meta_2.4_db9a1888eef284c9_.log 2025-12-04T09:36:55.6477991Z Running 0 items in this shard: 2025-12-04T09:36:55.6478164Z 2025-12-04T09:36:55.6478375Z Finished test_meta 2/4 ... [2025-12-04 09:36:55.647419][2659.311235533], took 1.00min 2025-12-04T09:36:55.6557418Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_meta/test_meta-183448c70bcf4d3b.xml 2025-12-04T09:36:55.7345569Z Running functorch/test_control_flow 2/4 ... [2025-12-04 09:36:55.734276][2659.398094515] 2025-12-04T09:36:55.7346261Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:36:55.7349257Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_control_flow.py', '-m', 'serial', '--shard-id=2', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:36:55.734596] 2025-12-04T09:37:02.1578331Z 2025-12-04T09:37:02.1579285Z functorch/test_control_flow 2/4 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_control_flow_2.4_6e5eff350cb47256_.log 2025-12-04T09:37:02.1580175Z Running 0 items in this shard: 2025-12-04T09:37:02.1580368Z 2025-12-04T09:37:02.1580688Z Finished functorch/test_control_flow 2/4 ... [2025-12-04 09:37:02.157604][2665.821421404], took 0.11min 2025-12-04T09:37:02.1656567Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-61cf112623d40fd3.xml 2025-12-04T09:37:02.2346495Z Running functorch/test_control_flow 3/4 ... [2025-12-04 09:37:02.234392][2665.898211128] 2025-12-04T09:37:02.2347012Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:37:02.2350063Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_control_flow.py', '-m', 'serial', '--shard-id=3', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:37:02.234707] 2025-12-04T09:37:08.6269275Z 2025-12-04T09:37:08.6270481Z functorch/test_control_flow 3/4 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_control_flow_3.4_d5eb7fcdd6d69766_.log 2025-12-04T09:37:08.6271447Z Running 0 items in this shard: 2025-12-04T09:37:08.6271626Z 2025-12-04T09:37:08.6271928Z Finished functorch/test_control_flow 3/4 ... [2025-12-04 09:37:08.626679][2672.290495682], took 0.11min 2025-12-04T09:37:08.6348324Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-510382c7091b27bc.xml 2025-12-04T09:37:08.7118648Z Running test_modules 1/2 ... [2025-12-04 09:37:08.711610][2672.375428498] 2025-12-04T09:37:08.7119162Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:37:08.7122401Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_modules.py', '-m', 'serial', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:37:08.711931] 2025-12-04T09:37:18.1233684Z 2025-12-04T09:37:18.1234396Z test_modules 1/2 was successful, full logs can be found in artifacts with path test/test-reports/test_modules_1.2_2228adb73596c4fc_.log 2025-12-04T09:37:18.1234973Z Running 0 items in this shard: 2025-12-04T09:37:18.1235119Z 2025-12-04T09:37:18.1235591Z Finished test_modules 1/2 ... [2025-12-04 09:37:18.123109][2681.786927426], took 0.16min 2025-12-04T09:37:18.1313436Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_modules/test_modules-3b26da1a0a5fa82d.xml 2025-12-04T09:37:18.2079864Z Running optim/test_lrscheduler 1/1 ... [2025-12-04 09:37:18.207696][2681.871514777] 2025-12-04T09:37:18.2080429Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:37:18.2083272Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'optim/test_lrscheduler.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:37:18.208012] 2025-12-04T09:37:20.9338838Z 2025-12-04T09:37:20.9339791Z optim/test_lrscheduler 1/1 was successful, full logs can be found in artifacts with path test/test-reports/optim.test_lrscheduler_1.1_3074b4c960c21b1e_.log 2025-12-04T09:37:20.9340478Z 2025-12-04T09:37:20.9340775Z Finished optim/test_lrscheduler 1/1 ... [2025-12-04 09:37:20.933630][2684.597448238], took 0.05min 2025-12-04T09:37:20.9424094Z Running test_sparse_csr 3/3 ... [2025-12-04 09:37:20.942182][2684.606001169] 2025-12-04T09:37:20.9424524Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:37:20.9427936Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_sparse_csr.py', '-m', 'serial', '--shard-id=3', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:37:20.942541] 2025-12-04T09:37:32.3749540Z 2025-12-04T09:37:32.3750298Z test_sparse_csr 3/3 was successful, full logs can be found in artifacts with path test/test-reports/test_sparse_csr_3.3_ff7431816ae40a69_.log 2025-12-04T09:37:32.3750869Z Running 0 items in this shard: 2025-12-04T09:37:32.3751010Z 2025-12-04T09:37:32.3751223Z Finished test_sparse_csr 3/3 ... [2025-12-04 09:37:32.374684][2696.038499677], took 0.19min 2025-12-04T09:37:32.3832393Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-360baa78a773af74.xml 2025-12-04T09:37:32.4586409Z Running functorch/test_ops 4/9 ... [2025-12-04 09:37:32.458395][2696.122213983] 2025-12-04T09:37:32.4588486Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:37:32.4590139Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_ops.py', '-m', 'serial', '--shard-id=4', '--num-shards=9', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:37:32.458710] 2025-12-04T09:37:50.9622844Z 2025-12-04T09:37:50.9623909Z functorch/test_ops 4/9 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_ops_4.9_907fae480dda49a3_.log 2025-12-04T09:37:50.9624931Z Running 0 items in this shard: 2025-12-04T09:37:50.9625177Z 2025-12-04T09:37:50.9625529Z Finished functorch/test_ops 4/9 ... [2025-12-04 09:37:50.962026][2714.625843263], took 0.31min 2025-12-04T09:37:50.9707102Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-ab58c0af20190c71.xml 2025-12-04T09:37:51.0489323Z Running functorch/test_ops 7/9 ... [2025-12-04 09:37:51.048690][2714.7125092] 2025-12-04T09:37:51.0490139Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:37:51.0493358Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_ops.py', '-m', 'serial', '--shard-id=7', '--num-shards=9', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:37:51.049009] 2025-12-04T09:38:09.5103661Z 2025-12-04T09:38:09.5105585Z functorch/test_ops 7/9 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_ops_7.9_8b11ba6dc2d21632_.log 2025-12-04T09:38:09.5107083Z Running 0 items in this shard: 2025-12-04T09:38:09.5107449Z 2025-12-04T09:38:09.5107965Z Finished functorch/test_ops 7/9 ... [2025-12-04 09:38:09.510106][2733.17392254], took 0.31min 2025-12-04T09:38:09.5188998Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-3d8b763c339d8b21.xml 2025-12-04T09:38:12.4707284Z Running inductor/test_collective_autotuning 1/1 ... [2025-12-04 09:38:12.470257][2736.134072703] 2025-12-04T09:38:12.4708055Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:12.4710333Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_collective_autotuning.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:12.470687] 2025-12-04T09:38:12.5465344Z Running inductor/test_cpu_cpp_wrapper 1/1 ... [2025-12-04 09:38:12.546069][2736.209883896] 2025-12-04T09:38:12.5465809Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:12.5468311Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cpu_cpp_wrapper.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:12.546528] 2025-12-04T09:38:12.5560829Z Running inductor/test_autoheuristic 1/1 ... [2025-12-04 09:38:12.555814][2736.219630499] 2025-12-04T09:38:12.5561366Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:12.5565110Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_autoheuristic.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:12.556212] 2025-12-04T09:38:15.8059497Z 2025-12-04T09:38:15.8060740Z inductor/test_collective_autotuning 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_collective_autotuning_1.1_aae30afac9f5c6aa_.log 2025-12-04T09:38:15.8062100Z Running 0 items in this shard: 2025-12-04T09:38:15.8062369Z 2025-12-04T09:38:15.8062800Z Finished inductor/test_collective_autotuning 1/1 ... [2025-12-04 09:38:15.805779][2739.469594616], took 0.06min 2025-12-04T09:38:15.8141190Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-51799f2bc7fe1d10.xml 2025-12-04T09:38:18.7392245Z 2025-12-04T09:38:18.7393244Z inductor/test_autoheuristic 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_autoheuristic_1.1_883c8c6c354bdbf8_.log 2025-12-04T09:38:18.7393902Z Running 0 items in this shard: 2025-12-04T09:38:18.7394051Z 2025-12-04T09:38:18.7394303Z Finished inductor/test_autoheuristic 1/1 ... [2025-12-04 09:38:18.739079][2742.402893674], took 0.10min 2025-12-04T09:38:18.7474361Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_autoheuristic/inductor.test_autoheuristic-bad78e46bec01c6d.xml 2025-12-04T09:38:19.4954416Z Running inductor/test_control_flow 5/5 ... [2025-12-04 09:38:19.494952][2743.158767613] 2025-12-04T09:38:19.4955060Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:19.4956853Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_control_flow.py', '-m', 'not serial', '--shard-id=5', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:19.495377] 2025-12-04T09:38:20.8363243Z 2025-12-04T09:38:20.8364297Z inductor/test_cpu_cpp_wrapper 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cpu_cpp_wrapper_1.1_954f49cf43186d21_.log 2025-12-04T09:38:20.8365133Z 2025-12-04T09:38:20.8365482Z Finished inductor/test_cpu_cpp_wrapper 1/1 ... [2025-12-04 09:38:20.836116][2744.499932606], took 0.14min 2025-12-04T09:38:22.3953817Z Running inductor/test_mps_basic 1/1 ... [2025-12-04 09:38:22.394932][2746.058747053] 2025-12-04T09:38:22.3954410Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:22.3957259Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_mps_basic.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:22.395381] 2025-12-04T09:38:24.4414512Z Running test_ops_fwd_gradients 3/3 ... [2025-12-04 09:38:24.440972][2748.104786746] 2025-12-04T09:38:24.4415035Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:24.4417777Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-m', 'not serial', '--shard-id=3', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:24.441461] 2025-12-04T09:38:26.8459629Z 2025-12-04T09:38:26.8460797Z inductor/test_control_flow 5/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_control_flow_5.5_61e6431b9f594de0_.log 2025-12-04T09:38:26.8462036Z Running 0 items in this shard: 2025-12-04T09:38:26.8462598Z 2025-12-04T09:38:26.8462915Z Finished inductor/test_control_flow 5/5 ... [2025-12-04 09:38:26.845854][2750.509671993], took 0.12min 2025-12-04T09:38:26.8547473Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-fb44f09fc737f7e3.xml 2025-12-04T09:38:29.0311697Z 2025-12-04T09:38:29.0313063Z inductor/test_mps_basic 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_mps_basic_1.1_de0965e0624bbdab_.log 2025-12-04T09:38:29.0314031Z 2025-12-04T09:38:29.0314415Z Finished inductor/test_mps_basic 1/1 ... [2025-12-04 09:38:29.030969][2752.694784877], took 0.11min 2025-12-04T09:38:30.5762796Z Running test_meta 1/4 ... [2025-12-04 09:38:30.575769][2754.239584817] 2025-12-04T09:38:30.5763302Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:30.5765503Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_meta.py', '-m', 'not serial', '--shard-id=1', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:30.576200] 2025-12-04T09:38:32.6396578Z Running test_meta 2/4 ... [2025-12-04 09:38:32.639186][2756.303002397] 2025-12-04T09:38:32.6397081Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:32.6399915Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_meta.py', '-m', 'not serial', '--shard-id=2', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:32.639640] 2025-12-04T09:38:33.9597085Z 2025-12-04T09:38:33.9598205Z test_ops_fwd_gradients 3/3 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_fwd_gradients_3.3_d155f918101e7d63_.log 2025-12-04T09:38:33.9599039Z Running 0 items in this shard: 2025-12-04T09:38:33.9599235Z 2025-12-04T09:38:33.9599517Z Finished test_ops_fwd_gradients 3/3 ... [2025-12-04 09:38:33.959571][2757.623389733], took 0.16min 2025-12-04T09:38:33.9693408Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-abc350caee946871.xml 2025-12-04T09:38:37.5638770Z Running functorch/test_control_flow 2/4 ... [2025-12-04 09:38:37.563357][2761.227171984] 2025-12-04T09:38:37.5639433Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:37.5641284Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_control_flow.py', '-m', 'not serial', '--shard-id=2', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:37.563761] 2025-12-04T09:38:44.5694359Z 2025-12-04T09:38:44.5695730Z functorch/test_control_flow 2/4 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_control_flow_2.4_b939fd6985e209a4_.log 2025-12-04T09:38:44.5696767Z Running 0 items in this shard: 2025-12-04T09:38:44.5696992Z 2025-12-04T09:38:44.5697376Z Finished functorch/test_control_flow 2/4 ... [2025-12-04 09:38:44.569319][2768.233133465], took 0.12min 2025-12-04T09:38:44.5788973Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-4894c67067cce928.xml 2025-12-04T09:38:48.3139793Z Running functorch/test_control_flow 3/4 ... [2025-12-04 09:38:48.313497][2771.977313075] 2025-12-04T09:38:48.3140574Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:48.3142263Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_control_flow.py', '-m', 'not serial', '--shard-id=3', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:48.313908] 2025-12-04T09:38:55.4821570Z 2025-12-04T09:38:55.4822777Z functorch/test_control_flow 3/4 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_control_flow_3.4_dfdac7e5b1bc5b13_.log 2025-12-04T09:38:55.4823793Z Running 0 items in this shard: 2025-12-04T09:38:55.4823996Z 2025-12-04T09:38:55.4824340Z Finished functorch/test_control_flow 3/4 ... [2025-12-04 09:38:55.482040][2779.145852279], took 0.12min 2025-12-04T09:38:55.4927152Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-d3ab6b26454b86ba.xml 2025-12-04T09:38:56.6804810Z Uploading artifacts took 1.10 seconds 2025-12-04T09:38:59.1864861Z Running test_modules 1/2 ... [2025-12-04 09:38:59.185971][2782.849785882] 2025-12-04T09:38:59.1865820Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:38:59.1867918Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_modules.py', '-m', 'not serial', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:38:59.186442] 2025-12-04T09:39:08.7923269Z 2025-12-04T09:39:08.7924344Z test_modules 1/2 was successful, full logs can be found in artifacts with path test/test-reports/test_modules_1.2_a952f73d153232c2_.log 2025-12-04T09:39:08.7925223Z Running 0 items in this shard: 2025-12-04T09:39:08.7925468Z 2025-12-04T09:39:08.7925759Z Finished test_modules 1/2 ... [2025-12-04 09:39:08.792165][2792.455981221], took 0.16min 2025-12-04T09:39:08.8021552Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_modules/test_modules-2f9f220a5fc72a18.xml 2025-12-04T09:39:12.2366984Z Running optim/test_lrscheduler 1/1 ... [2025-12-04 09:39:12.236210][2795.900025826] 2025-12-04T09:39:12.2367620Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:39:12.2369970Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'optim/test_lrscheduler.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:39:12.236631] 2025-12-04T09:39:15.2416613Z 2025-12-04T09:39:15.2417458Z optim/test_lrscheduler 1/1 was successful, full logs can be found in artifacts with path test/test-reports/optim.test_lrscheduler_1.1_2fa908785b6f7913_.log 2025-12-04T09:39:15.2425823Z 2025-12-04T09:39:15.2426158Z Finished optim/test_lrscheduler 1/1 ... [2025-12-04 09:39:15.241499][2798.905316021], took 0.05min 2025-12-04T09:39:18.6489254Z Running test_sparse_csr 3/3 ... [2025-12-04 09:39:18.648433][2802.312249472] 2025-12-04T09:39:18.6489817Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:39:18.6492221Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_sparse_csr.py', '-m', 'not serial', '--shard-id=3', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:39:18.648875] 2025-12-04T09:39:30.6765984Z 2025-12-04T09:39:30.6767143Z test_sparse_csr 3/3 was successful, full logs can be found in artifacts with path test/test-reports/test_sparse_csr_3.3_16435fa6adc55229_.log 2025-12-04T09:39:30.6768121Z Running 0 items in this shard: 2025-12-04T09:39:30.6768340Z 2025-12-04T09:39:30.6768617Z Finished test_sparse_csr 3/3 ... [2025-12-04 09:39:30.676495][2814.340308057], took 0.20min 2025-12-04T09:39:30.6859834Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_meta/test_meta-e4256c1c7d578c86.xml 2025-12-04T09:39:30.7603731Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_meta/test_meta-8c2b92910b7a7bdc.xml 2025-12-04T09:39:30.7866467Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-db92a6b1a0084256.xml 2025-12-04T09:39:33.7116699Z 2025-12-04T09:39:33.7118056Z test_meta 1/4 was successful, full logs can be found in artifacts with path test/test-reports/test_meta_1.4_8e318a314d876422_.log 2025-12-04T09:39:33.7119082Z Running 0 items in this shard: 2025-12-04T09:39:33.7119279Z 2025-12-04T09:39:33.7119528Z Finished test_meta 1/4 ... [2025-12-04 09:39:33.711564][2817.375378689], took 1.05min 2025-12-04T09:39:34.6115703Z Running functorch/test_ops 4/9 ... [2025-12-04 09:39:34.611061][2818.274876237] 2025-12-04T09:39:34.6116351Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:39:34.6118305Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_ops.py', '-m', 'not serial', '--shard-id=4', '--num-shards=9', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:39:34.611478] 2025-12-04T09:39:35.2545297Z 2025-12-04T09:39:35.2546283Z test_meta 2/4 was successful, full logs can be found in artifacts with path test/test-reports/test_meta_2.4_3fceec0bce921fc9_.log 2025-12-04T09:39:35.2547085Z Running 0 items in this shard: 2025-12-04T09:39:35.2547301Z 2025-12-04T09:39:35.2547545Z Finished test_meta 2/4 ... [2025-12-04 09:39:35.254387][2818.918203997], took 1.04min 2025-12-04T09:39:37.3938634Z Running functorch/test_ops 7/9 ... [2025-12-04 09:39:37.393383][2821.057199207] 2025-12-04T09:39:37.3939234Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:39:37.3941295Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_ops.py', '-m', 'not serial', '--shard-id=7', '--num-shards=9', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:39:37.393793] 2025-12-04T09:39:53.4091574Z 2025-12-04T09:39:53.4092730Z functorch/test_ops 4/9 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_ops_4.9_0564c349a6821973_.log 2025-12-04T09:39:53.4093556Z Running 0 items in this shard: 2025-12-04T09:39:53.4093740Z 2025-12-04T09:39:53.4094236Z Finished functorch/test_ops 4/9 ... [2025-12-04 09:39:53.409082][2837.072897496], took 0.31min 2025-12-04T09:39:53.4198029Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-77e738022a2c01ca.xml 2025-12-04T09:39:56.6341551Z 2025-12-04T09:39:56.6342693Z functorch/test_ops 7/9 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_ops_7.9_0314f84c3490751a_.log 2025-12-04T09:39:56.6344041Z Running 0 items in this shard: 2025-12-04T09:39:56.6344358Z 2025-12-04T09:39:56.6344783Z Finished functorch/test_ops 7/9 ... [2025-12-04 09:39:56.634015][2840.297834167], took 0.32min 2025-12-04T09:39:56.6450381Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-8fe434a401eb0404.xml 2025-12-04T09:39:57.6689738Z Running test batch 'tests to run' cost 1954.85 seconds 2025-12-04T09:39:57.6701271Z Emitting td_test_failure_stats_v2 2025-12-04T09:39:57.6704656Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764841197_3239d9a2d0f511f0a1cf0242ac110002 2025-12-04T09:39:57.7707857Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764841197_3239d9a2d0f511f0a1cf0242ac110002 2025-12-04T09:39:57.7718154Z Emitting td_test_failure_stats_v2 2025-12-04T09:39:57.7720024Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764841197_324957e2d0f511f0a1cf0242ac110002 2025-12-04T09:39:57.8056361Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764841197_324957e2d0f511f0a1cf0242ac110002 2025-12-04T09:39:57.8068961Z Emitting td_test_failure_stats_v2 2025-12-04T09:39:57.8070827Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764841197_324eb1ecd0f511f0a1cf0242ac110002 2025-12-04T09:39:57.8351947Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764841197_324eb1ecd0f511f0a1cf0242ac110002 2025-12-04T09:39:57.8366387Z Emitting td_test_failure_stats_v2 2025-12-04T09:39:57.8368068Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764841197_32533b40d0f511f0a1cf0242ac110002 2025-12-04T09:39:57.8746822Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764841197_32533b40d0f511f0a1cf0242ac110002 2025-12-04T09:39:57.8748302Z inductor/test_flex_attention 1/6 failed! 2025-12-04T09:39:57.8748750Z test_ci_sanity_check_fail 1/1 failed! 2025-12-04T09:39:57.8749045Z test_torch 1/1 failed! 2025-12-04T09:39:57.8749292Z test_fake_tensor 1/1 failed! 2025-12-04T09:39:58.8704781Z 2025-12-04T09:39:58.8705141Z real 32m40.502s 2025-12-04T09:39:58.8705362Z user 60m51.485s 2025-12-04T09:39:58.8705530Z sys 7m52.179s 2025-12-04T09:39:58.8705693Z + assert_git_not_dirty 2025-12-04T09:39:58.8705966Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *rocm* ]] 2025-12-04T09:39:58.8706270Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *xla* ]] 2025-12-04T09:39:58.8711344Z ++ git status --porcelain 2025-12-04T09:39:58.8712681Z ++ grep -v '?? third_party' 2025-12-04T09:40:02.0829020Z ++ true 2025-12-04T09:40:02.0829920Z + git_status= 2025-12-04T09:40:02.0830282Z + [[ -n '' ]] 2025-12-04T09:40:02.0830819Z + test_aten 2025-12-04T09:40:02.0831424Z + echo 'Running ATen tests with pytorch lib' 2025-12-04T09:40:02.0831770Z Running ATen tests with pytorch lib 2025-12-04T09:40:02.0832117Z + [[ -n '' ]] 2025-12-04T09:40:02.0832405Z + echo 'Running test with the build folder' 2025-12-04T09:40:02.0832710Z Running test with the build folder 2025-12-04T09:40:02.0833094Z + TEST_BASE_DIR=build/bin 2025-12-04T09:40:02.0834374Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10d_cuda_test.so build/bin 2025-12-04T09:40:02.0867953Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libcaffe2_nvrtc.so build/bin 2025-12-04T09:40:02.0880926Z + ln -sf '/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libmkldnn*' build/bin 2025-12-04T09:40:02.0893164Z + ln -sf '/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libnccl*' build/bin 2025-12-04T09:40:02.0909152Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_global_deps.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_nvshmem.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_python.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorchbind_test.so build/bin 2025-12-04T09:40:02.0921930Z + ls build/bin 2025-12-04T09:40:02.0983278Z BackoffTest 2025-12-04T09:40:02.0983552Z CppSignature_test 2025-12-04T09:40:02.0983755Z Dict_test 2025-12-04T09:40:02.0984007Z Dimname_test 2025-12-04T09:40:02.0984215Z FileStoreTest 2025-12-04T09:40:02.0984499Z HashStoreTest 2025-12-04T09:40:02.0984849Z IListRef_test 2025-12-04T09:40:02.0985136Z KernelFunction_test 2025-12-04T09:40:02.0985490Z List_test 2025-12-04T09:40:02.0985795Z MaybeOwned_test 2025-12-04T09:40:02.0986137Z NamedTensor_test 2025-12-04T09:40:02.0986373Z ProcessGroupGlooAsyncTest 2025-12-04T09:40:02.0986730Z ProcessGroupGlooTest 2025-12-04T09:40:02.0986998Z ProcessGroupMPITest 2025-12-04T09:40:02.0987257Z ProcessGroupNCCLErrorsTest 2025-12-04T09:40:02.0987528Z ProcessGroupNCCLTest 2025-12-04T09:40:02.0987755Z StorageUtils_test 2025-12-04T09:40:02.0987986Z TCPStoreTest 2025-12-04T09:40:02.0988258Z apply_utils_test 2025-12-04T09:40:02.0988460Z atest 2025-12-04T09:40:02.0988656Z backend_fallback_test 2025-12-04T09:40:02.0988869Z basic 2025-12-04T09:40:02.0989062Z broadcast_test 2025-12-04T09:40:02.0989280Z c10_AllocatorConfig_test 2025-12-04T09:40:02.0989511Z c10_ArrayRef_test 2025-12-04T09:40:02.0989759Z c10_Bitset_test 2025-12-04T09:40:02.0990163Z c10_CompileTimeFunctionPointer_test 2025-12-04T09:40:02.0990661Z c10_ConstexprCrc_test 2025-12-04T09:40:02.0991071Z c10_DeadlockDetection_test 2025-12-04T09:40:02.0991486Z c10_DeviceGuard_test 2025-12-04T09:40:02.0991843Z c10_Device_test 2025-12-04T09:40:02.0992413Z c10_DispatchKeySet_test 2025-12-04T09:40:02.0992799Z c10_Enumerate_test 2025-12-04T09:40:02.0993007Z c10_Half_test 2025-12-04T09:40:02.0993222Z c10_InlineDeviceGuard_test 2025-12-04T09:40:02.0993597Z c10_InlineStreamGuard_test 2025-12-04T09:40:02.0993845Z c10_IntrusiveList_test 2025-12-04T09:40:02.0994072Z c10_LeftRight_test 2025-12-04T09:40:02.0994293Z c10_NetworkFlow_test 2025-12-04T09:40:02.0994512Z c10_Scalar_test 2025-12-04T09:40:02.0994717Z c10_Semaphore_test 2025-12-04T09:40:02.0994899Z c10_SizesAndStrides_test 2025-12-04T09:40:02.0995095Z c10_StreamGuard_test 2025-12-04T09:40:02.0995265Z c10_SymInt_test 2025-12-04T09:40:02.0995433Z c10_Synchronized_test 2025-12-04T09:40:02.0995605Z c10_ThreadLocal_test 2025-12-04T09:40:02.0995757Z c10_TypeIndex_test 2025-12-04T09:40:02.0995916Z c10_accumulate_test 2025-12-04T09:40:02.0996165Z c10_bfloat16_test 2025-12-04T09:40:02.0996315Z c10_bit_cast_test 2025-12-04T09:40:02.0996473Z c10_complex_math_test 2025-12-04T09:40:02.0996648Z c10_complex_test 2025-12-04T09:40:02.0996794Z c10_cow_test 2025-12-04T09:40:02.0996967Z c10_cuda_CUDAAssertionsTest_1_var_test 2025-12-04T09:40:02.0997209Z c10_cuda_CUDAAssertionsTest_catches_stream 2025-12-04T09:40:02.0997498Z c10_cuda_CUDAAssertionsTest_catches_thread_and_block_and_device 2025-12-04T09:40:02.0997806Z c10_cuda_CUDAAssertionsTest_from_2_processes 2025-12-04T09:40:02.0998187Z c10_cuda_CUDAAssertionsTest_multiple_writes_from_blocks_and_threads 2025-12-04T09:40:02.0998547Z c10_cuda_CUDAAssertionsTest_multiple_writes_from_multiple_blocks 2025-12-04T09:40:02.0998893Z c10_cuda_CUDAAssertionsTest_multiple_writes_from_same_block 2025-12-04T09:40:02.0999154Z c10_cuda_CUDATest 2025-12-04T09:40:02.0999309Z c10_error_test 2025-12-04T09:40:02.0999458Z c10_exception_test 2025-12-04T09:40:02.0999616Z c10_flags_test 2025-12-04T09:40:02.0999769Z c10_generic_math_test 2025-12-04T09:40:02.0999938Z c10_intrusive_ptr_benchmark 2025-12-04T09:40:02.1000129Z c10_intrusive_ptr_test 2025-12-04T09:40:02.1000295Z c10_irange_test 2025-12-04T09:40:02.1000442Z c10_lazy_test 2025-12-04T09:40:02.1000593Z c10_logging_test 2025-12-04T09:40:02.1000758Z c10_nofatal_test 2025-12-04T09:40:02.1000911Z c10_optional_test 2025-12-04T09:40:02.1001086Z c10_ordered_preserving_dict_test 2025-12-04T09:40:02.1001281Z c10_registry_test 2025-12-04T09:40:02.1001433Z c10_small_vector_test 2025-12-04T09:40:02.1001597Z c10_ssize_test 2025-12-04T09:40:02.1001750Z c10_string_util_test 2025-12-04T09:40:02.1001908Z c10_string_view_test 2025-12-04T09:40:02.1002071Z c10_tempfile_test 2025-12-04T09:40:02.1002229Z c10_typeid_test 2025-12-04T09:40:02.1002390Z cpu_allocator_test 2025-12-04T09:40:02.1002556Z cpu_generator_test 2025-12-04T09:40:02.1002725Z cpu_profiling_allocator_test 2025-12-04T09:40:02.1002909Z cpu_rng_test 2025-12-04T09:40:02.1003081Z cuda_allocatorTraceTracker_test 2025-12-04T09:40:02.1003280Z cuda_allocator_test 2025-12-04T09:40:02.1003445Z cuda_apply_test 2025-12-04T09:40:02.1003624Z cuda_atomic_ops_test 2025-12-04T09:40:02.1003797Z cuda_caching_host_allocator_test 2025-12-04T09:40:02.1003995Z cuda_complex_math_test 2025-12-04T09:40:02.1004165Z cuda_complex_test 2025-12-04T09:40:02.1004320Z cuda_cub_test 2025-12-04T09:40:02.1004482Z cuda_cublas_handle_pool_test 2025-12-04T09:40:02.1004666Z cuda_cudnn_test 2025-12-04T09:40:02.1004819Z cuda_device_test 2025-12-04T09:40:02.1004978Z cuda_distributions_test 2025-12-04T09:40:02.1005161Z cuda_dlconvertor_test 2025-12-04T09:40:02.1005324Z cuda_event_test 2025-12-04T09:40:02.1005488Z cuda_exchange_device_test 2025-12-04T09:40:02.1005672Z cuda_generator_test 2025-12-04T09:40:02.1005828Z cuda_half_test 2025-12-04T09:40:02.1005977Z cuda_integer_divider_test 2025-12-04T09:40:02.1006152Z cuda_optional_test 2025-12-04T09:40:02.1006321Z cuda_packedtensoraccessor_test 2025-12-04T09:40:02.1006522Z cuda_reportMemoryUsage_test 2025-12-04T09:40:02.1006705Z cuda_stream_test 2025-12-04T09:40:02.1006860Z cuda_vectorized_test 2025-12-04T09:40:02.1007067Z dlconvertor_test 2025-12-04T09:40:02.1007247Z example_allreduce 2025-12-04T09:40:02.1007422Z extension_backend_test 2025-12-04T09:40:02.1007585Z half_test 2025-12-04T09:40:02.1007780Z inline_container_test 2025-12-04T09:40:02.1007940Z ivalue_test 2025-12-04T09:40:02.1008107Z kernel_function_legacy_test 2025-12-04T09:40:02.1008289Z kernel_function_test 2025-12-04T09:40:02.1008458Z kernel_lambda_legacy_test 2025-12-04T09:40:02.1008635Z kernel_lambda_test 2025-12-04T09:40:02.1008790Z kernel_stackbased_test 2025-12-04T09:40:02.1008957Z lazy_tensor_test 2025-12-04T09:40:02.1009110Z legacy_vmap_test 2025-12-04T09:40:02.1009255Z libc10.so 2025-12-04T09:40:02.1009397Z libc10_cuda.so 2025-12-04T09:40:02.1009547Z libc10d_cuda_test.so 2025-12-04T09:40:02.1009706Z libcaffe2_nvrtc.so 2025-12-04T09:40:02.1009856Z 'libmkldnn*' 2025-12-04T09:40:02.1010054Z 'libnccl*' 2025-12-04T09:40:02.1010192Z libtorch.so 2025-12-04T09:40:02.1010334Z libtorch_cpu.so 2025-12-04T09:40:02.1010488Z libtorch_cuda.so 2025-12-04T09:40:02.1010648Z libtorch_cuda_linalg.so 2025-12-04T09:40:02.1010821Z libtorch_global_deps.so 2025-12-04T09:40:02.1010996Z libtorch_nvshmem.so 2025-12-04T09:40:02.1011160Z libtorch_python.so 2025-12-04T09:40:02.1011314Z libtorchbind_test.so 2025-12-04T09:40:02.1011496Z make_boxed_from_unboxed_functor_test 2025-12-04T09:40:02.1011700Z math_kernel_test 2025-12-04T09:40:02.1011901Z memory_format_test 2025-12-04T09:40:02.1012073Z memory_overlapping_test 2025-12-04T09:40:02.1012384Z mobile_memory_cleanup 2025-12-04T09:40:02.1012550Z native_test 2025-12-04T09:40:02.1012698Z op_allowlist_test 2025-12-04T09:40:02.1012868Z op_registration_test 2025-12-04T09:40:02.1013030Z operator_name_test 2025-12-04T09:40:02.1013188Z operators_test 2025-12-04T09:40:02.1013357Z packedtensoraccessor_test 2025-12-04T09:40:02.1013544Z parallel_benchmark 2025-12-04T09:40:02.1013703Z pow_test 2025-12-04T09:40:02.1013841Z protoc 2025-12-04T09:40:02.1013976Z protoc-3.13.0.0 2025-12-04T09:40:02.1014135Z quantized_test 2025-12-04T09:40:02.1014294Z reduce_ops_test 2025-12-04T09:40:02.1014452Z reportMemoryUsage_test 2025-12-04T09:40:02.1014622Z scalar_tensor_test 2025-12-04T09:40:02.1014771Z scalar_test 2025-12-04T09:40:02.1014928Z stride_properties_test 2025-12-04T09:40:02.1015103Z tensor_iterator_test 2025-12-04T09:40:02.1015264Z test_aoti_abi_check 2025-12-04T09:40:02.1015418Z test_api 2025-12-04T09:40:02.1015556Z test_cpp_rpc 2025-12-04T09:40:02.1015702Z test_dist_autograd 2025-12-04T09:40:02.1015853Z test_jit 2025-12-04T09:40:02.1015980Z test_lazy 2025-12-04T09:40:02.1016119Z test_parallel 2025-12-04T09:40:02.1016268Z test_vec_half_AVX2 2025-12-04T09:40:02.1016423Z test_vec_half_AVX512 2025-12-04T09:40:02.1016585Z test_vec_half_DEFAULT 2025-12-04T09:40:02.1016746Z thread_init_test 2025-12-04T09:40:02.1016892Z torch_shm_manager 2025-12-04T09:40:02.1017058Z type_ptr_test 2025-12-04T09:40:02.1017210Z type_test 2025-12-04T09:40:02.1017353Z undefined_tensor_test 2025-12-04T09:40:02.1017528Z vec_test_all_types_AVX2 2025-12-04T09:40:02.1017707Z vec_test_all_types_AVX512 2025-12-04T09:40:02.1017885Z vec_test_all_types_DEFAULT 2025-12-04T09:40:02.1018072Z verify_api_visibility 2025-12-04T09:40:02.1018238Z weakref_test 2025-12-04T09:40:02.1018379Z wrapdim_test 2025-12-04T09:40:02.1018538Z xla_tensor_test 2025-12-04T09:40:02.1018707Z + aten/tools/run_tests.sh build/bin 2025-12-04T09:40:02.1024326Z + set -e 2025-12-04T09:40:02.1027219Z ++ dirname aten/tools/run_tests.sh 2025-12-04T09:40:02.1045333Z + VALGRIND_SUP=/var/lib/jenkins/workspace/aten/tools/valgrind.sup 2025-12-04T09:40:02.1045944Z + export CPP_TESTS_DIR=build/bin 2025-12-04T09:40:02.1046219Z + CPP_TESTS_DIR=build/bin 2025-12-04T09:40:02.1046466Z + VALGRIND=ON 2025-12-04T09:40:02.1048058Z + python test/run_test.py --cpp --verbose -i cpp/basic cpp/atest cpp/scalar_test cpp/broadcast_test cpp/wrapdim_test cpp/apply_utils_test cpp/dlconvertor_test cpp/native_test cpp/scalar_tensor_test cpp/undefined_tensor_test cpp/extension_backend_test cpp/lazy_tensor_test cpp/tensor_iterator_test cpp/Dimname_test cpp/Dict_test cpp/NamedTensor_test cpp/cpu_generator_test cpp/legacy_vmap_test cpp/operators_test 2025-12-04T09:40:06.3965827Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:40:06.4045141Z Found test times from artifacts 2025-12-04T09:40:06.4363560Z Found test times from artifacts 2025-12-04T09:40:06.4373388Z Running all tests 2025-12-04T09:40:06.4377467Z Running parallel tests on 3 processes 2025-12-04T09:40:06.4379413Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:40:06.4379749Z Serial tests (0): 2025-12-04T09:40:06.4379993Z Parallel tests (19): 2025-12-04T09:40:06.4380241Z cpp/Dict_test 1/1 2025-12-04T09:40:06.4380483Z cpp/Dimname_test 1/1 2025-12-04T09:40:06.4380934Z cpp/NamedTensor_test 1/1 2025-12-04T09:40:06.4381203Z cpp/apply_utils_test 1/1 2025-12-04T09:40:06.4381451Z cpp/atest 1/1 2025-12-04T09:40:06.4381690Z cpp/basic 1/1 2025-12-04T09:40:06.4381914Z cpp/broadcast_test 1/1 2025-12-04T09:40:06.4382172Z cpp/cpu_generator_test 1/1 2025-12-04T09:40:06.4382442Z cpp/dlconvertor_test 1/1 2025-12-04T09:40:06.4382719Z cpp/extension_backend_test 1/1 2025-12-04T09:40:06.4382998Z cpp/lazy_tensor_test 1/1 2025-12-04T09:40:06.4383271Z cpp/legacy_vmap_test 1/1 2025-12-04T09:40:06.4383652Z cpp/native_test 1/1 2025-12-04T09:40:06.4383912Z cpp/operators_test 1/1 2025-12-04T09:40:06.4384225Z cpp/scalar_tensor_test 1/1 2025-12-04T09:40:06.4384662Z cpp/scalar_test 1/1 2025-12-04T09:40:06.4384939Z cpp/tensor_iterator_test 1/1 2025-12-04T09:40:06.4385230Z cpp/undefined_tensor_test 1/1 2025-12-04T09:40:06.4385553Z cpp/wrapdim_test 1/1 2025-12-04T09:40:06.4385871Z Name: excluded (est. time: 0.0min) 2025-12-04T09:40:06.4386099Z Serial tests (0): 2025-12-04T09:40:06.4386287Z Parallel tests (0): 2025-12-04T09:40:06.4386590Z Running cpp/Dict_test 1/1 ... [2025-12-04 09:40:06.438298][2850.102118671] 2025-12-04T09:40:06.4386925Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:06.4387237Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:06.4387673Z Finished cpp/Dict_test 1/1 ... [2025-12-04 09:40:06.438461][2850.102282413], took 0.00min 2025-12-04T09:40:07.8271496Z Uploading artifacts took 1.38 seconds 2025-12-04T09:40:07.8272243Z Running cpp/Dimname_test 1/1 ... [2025-12-04 09:40:07.826908][2851.490723772] 2025-12-04T09:40:07.8272643Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.8273007Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.8273530Z Finished cpp/Dimname_test 1/1 ... [2025-12-04 09:40:07.827075][2851.490895465], took 0.00min 2025-12-04T09:40:07.8375420Z Running cpp/NamedTensor_test 1/1 ... [2025-12-04 09:40:07.837320][2851.501140675] 2025-12-04T09:40:07.8376011Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.8376381Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.8376906Z Finished cpp/NamedTensor_test 1/1 ... [2025-12-04 09:40:07.837458][2851.501278707], took 0.00min 2025-12-04T09:40:07.8473684Z Running cpp/apply_utils_test 1/1 ... [2025-12-04 09:40:07.847140][2851.510960749] 2025-12-04T09:40:07.8474238Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.8474600Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.8475103Z Finished cpp/apply_utils_test 1/1 ... [2025-12-04 09:40:07.847273][2851.511094031], took 0.00min 2025-12-04T09:40:07.8570996Z Running cpp/atest 1/1 ... [2025-12-04 09:40:07.856881][2851.52070248] 2025-12-04T09:40:07.8571569Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.8571924Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.8572484Z Finished cpp/atest 1/1 ... [2025-12-04 09:40:07.857015][2851.520835552], took 0.00min 2025-12-04T09:40:07.8667921Z Running cpp/basic 1/1 ... [2025-12-04 09:40:07.866570][2851.530391181] 2025-12-04T09:40:07.8668758Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.8669144Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.8669605Z Finished cpp/basic 1/1 ... [2025-12-04 09:40:07.866703][2851.530523993], took 0.00min 2025-12-04T09:40:07.8763791Z Running cpp/broadcast_test 1/1 ... [2025-12-04 09:40:07.876157][2851.539978659] 2025-12-04T09:40:07.8764235Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.8765242Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.8765758Z Finished cpp/broadcast_test 1/1 ... [2025-12-04 09:40:07.876377][2851.540197543], took 0.00min 2025-12-04T09:40:07.8861359Z Running cpp/cpu_generator_test 1/1 ... [2025-12-04 09:40:07.885945][2851.549766733] 2025-12-04T09:40:07.8862149Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.8862522Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.8863207Z Finished cpp/cpu_generator_test 1/1 ... [2025-12-04 09:40:07.886067][2851.549889085], took 0.00min 2025-12-04T09:40:07.8957843Z Running cpp/dlconvertor_test 1/1 ... [2025-12-04 09:40:07.895552][2851.559373581] 2025-12-04T09:40:07.8958464Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.8958830Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.8959339Z Finished cpp/dlconvertor_test 1/1 ... [2025-12-04 09:40:07.895667][2851.559489043], took 0.00min 2025-12-04T09:40:07.9053008Z Running cpp/extension_backend_test 1/1 ... [2025-12-04 09:40:07.905072][2851.56889313] 2025-12-04T09:40:07.9053568Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9053929Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9054445Z Finished cpp/extension_backend_test 1/1 ... [2025-12-04 09:40:07.905188][2851.569009792], took 0.00min 2025-12-04T09:40:07.9147977Z Running cpp/lazy_tensor_test 1/1 ... [2025-12-04 09:40:07.914584][2851.578405007] 2025-12-04T09:40:07.9148476Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9148838Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9149359Z Finished cpp/lazy_tensor_test 1/1 ... [2025-12-04 09:40:07.914700][2851.578521879], took 0.00min 2025-12-04T09:40:07.9242774Z Running cpp/legacy_vmap_test 1/1 ... [2025-12-04 09:40:07.924049][2851.587870454] 2025-12-04T09:40:07.9243207Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9243559Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9244064Z Finished cpp/legacy_vmap_test 1/1 ... [2025-12-04 09:40:07.924169][2851.587990316], took 0.00min 2025-12-04T09:40:07.9337072Z Running cpp/native_test 1/1 ... [2025-12-04 09:40:07.933502][2851.59732336] 2025-12-04T09:40:07.9337640Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9337973Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9338461Z Finished cpp/native_test 1/1 ... [2025-12-04 09:40:07.933619][2851.597441292], took 0.00min 2025-12-04T09:40:07.9431678Z Running cpp/operators_test 1/1 ... [2025-12-04 09:40:07.942966][2851.606787657] 2025-12-04T09:40:07.9432243Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9432600Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9433097Z Finished cpp/operators_test 1/1 ... [2025-12-04 09:40:07.943084][2851.606905659], took 0.00min 2025-12-04T09:40:07.9526209Z Running cpp/scalar_tensor_test 1/1 ... [2025-12-04 09:40:07.952397][2851.616218494] 2025-12-04T09:40:07.9526683Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9527052Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9527584Z Finished cpp/scalar_tensor_test 1/1 ... [2025-12-04 09:40:07.952512][2851.616333825], took 0.00min 2025-12-04T09:40:07.9621256Z Running cpp/scalar_test 1/1 ... [2025-12-04 09:40:07.961888][2851.625709431] 2025-12-04T09:40:07.9621835Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9622266Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9622997Z Finished cpp/scalar_test 1/1 ... [2025-12-04 09:40:07.962007][2851.625828833], took 0.00min 2025-12-04T09:40:07.9715683Z Running cpp/tensor_iterator_test 1/1 ... [2025-12-04 09:40:07.971330][2851.635150917] 2025-12-04T09:40:07.9716373Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9716788Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9717424Z Finished cpp/tensor_iterator_test 1/1 ... [2025-12-04 09:40:07.971449][2851.635270849], took 0.00min 2025-12-04T09:40:07.9809577Z Running cpp/undefined_tensor_test 1/1 ... [2025-12-04 09:40:07.980755][2851.644576363] 2025-12-04T09:40:07.9810150Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9810530Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9811102Z Finished cpp/undefined_tensor_test 1/1 ... [2025-12-04 09:40:07.980873][2851.644694956], took 0.00min 2025-12-04T09:40:07.9903933Z Running cpp/wrapdim_test 1/1 ... [2025-12-04 09:40:07.990190][2851.654010961] 2025-12-04T09:40:07.9904550Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:07.9904955Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:07.9905511Z Finished cpp/wrapdim_test 1/1 ... [2025-12-04 09:40:07.990309][2851.654130952], took 0.00min 2025-12-04T09:40:10.9536570Z Running cpp/Dict_test 1/1 ... [2025-12-04 09:40:10.953083][2854.616900081] 2025-12-04T09:40:10.9538563Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:10.9539215Z Running cpp/Dimname_test 1/1 ... [2025-12-04 09:40:10.953195][2854.617011622] 2025-12-04T09:40:10.9539768Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:10.9540271Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:10.9540978Z Finished cpp/Dict_test 1/1 ... [2025-12-04 09:40:10.953306][2854.617125464], took 0.00min 2025-12-04T09:40:10.9541696Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:10.9542469Z Finished cpp/Dimname_test 1/1 ... [2025-12-04 09:40:10.953371][2854.617192076], took 0.00min 2025-12-04T09:40:10.9549634Z Running cpp/NamedTensor_test 1/1 ... [2025-12-04 09:40:10.954659][2854.618474568] 2025-12-04T09:40:10.9550265Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:10.9550793Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:10.9551215Z Finished cpp/NamedTensor_test 1/1 ... [2025-12-04 09:40:10.954882][2854.618701742], took 0.00min 2025-12-04T09:40:14.7223142Z Running cpp/apply_utils_test 1/1 ... [2025-12-04 09:40:14.721828][2858.385644498] 2025-12-04T09:40:14.7223850Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:14.7224321Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:14.7224880Z Finished cpp/apply_utils_test 1/1 ... [2025-12-04 09:40:14.722005][2858.385825701], took 0.00min 2025-12-04T09:40:14.7643139Z Running cpp/atest 1/1 ... [2025-12-04 09:40:14.763833][2858.427648138] 2025-12-04T09:40:14.7643898Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:14.7644362Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:14.7645203Z Finished cpp/atest 1/1 ... [2025-12-04 09:40:14.764009][2858.427829481], took 0.00min 2025-12-04T09:40:14.7654635Z Running cpp/basic 1/1 ... [2025-12-04 09:40:14.765203][2858.429019462] 2025-12-04T09:40:14.7655173Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:14.7655963Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:14.7656572Z Finished cpp/basic 1/1 ... [2025-12-04 09:40:14.765403][2858.429223686], took 0.00min 2025-12-04T09:40:18.5085077Z Running cpp/broadcast_test 1/1 ... [2025-12-04 09:40:18.508033][2862.171849732] 2025-12-04T09:40:18.5085756Z Running cpp/cpu_generator_test 1/1 ... [2025-12-04 09:40:18.508058][2862.171874653] 2025-12-04T09:40:18.5086224Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:18.5086529Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:18.5086945Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:18.5087749Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:18.5088360Z Finished cpp/broadcast_test 1/1 ... [2025-12-04 09:40:18.508207][2862.172027305], took 0.00min 2025-12-04T09:40:18.5089042Z Finished cpp/cpu_generator_test 1/1 ... [2025-12-04 09:40:18.508210][2862.172032125], took 0.00min 2025-12-04T09:40:18.5207326Z Running cpp/dlconvertor_test 1/1 ... [2025-12-04 09:40:18.520392][2862.1842085] 2025-12-04T09:40:18.5207845Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:18.5208215Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:18.5208737Z Finished cpp/dlconvertor_test 1/1 ... [2025-12-04 09:40:18.520566][2862.184385803], took 0.00min 2025-12-04T09:40:22.2586443Z Running cpp/extension_backend_test 1/1 ... [2025-12-04 09:40:22.258122][2865.92193845] 2025-12-04T09:40:22.2587193Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:22.2588202Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:22.2588909Z Finished cpp/extension_backend_test 1/1 ... [2025-12-04 09:40:22.258317][2865.922136613], took 0.00min 2025-12-04T09:40:22.2725278Z Running cpp/lazy_tensor_test 1/1 ... [2025-12-04 09:40:22.272223][2865.936039909] 2025-12-04T09:40:22.2725753Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:22.2726337Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:22.2727111Z Finished cpp/lazy_tensor_test 1/1 ... [2025-12-04 09:40:22.272415][2865.936234312], took 0.00min 2025-12-04T09:40:22.2913351Z Running cpp/legacy_vmap_test 1/1 ... [2025-12-04 09:40:22.291015][2865.95483069] 2025-12-04T09:40:22.2913931Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:22.2914303Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:22.2914839Z Finished cpp/legacy_vmap_test 1/1 ... [2025-12-04 09:40:22.291188][2865.955008063], took 0.00min 2025-12-04T09:40:25.9389254Z Running cpp/native_test 1/1 ... [2025-12-04 09:40:25.938421][2869.602237568] 2025-12-04T09:40:25.9389739Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:25.9565658Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:25.9566375Z Finished cpp/native_test 1/1 ... [2025-12-04 09:40:25.938633][2869.602452811], took 0.00min 2025-12-04T09:40:25.9566920Z Running cpp/operators_test 1/1 ... [2025-12-04 09:40:25.956249][2869.620064012] 2025-12-04T09:40:25.9567315Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:25.9567683Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:25.9568178Z Finished cpp/operators_test 1/1 ... [2025-12-04 09:40:25.956412][2869.620232375], took 0.00min 2025-12-04T09:40:26.0612767Z Running cpp/scalar_tensor_test 1/1 ... [2025-12-04 09:40:26.060816][2869.724631865] 2025-12-04T09:40:26.0613404Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:26.0613824Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:26.0614479Z Finished cpp/scalar_tensor_test 1/1 ... [2025-12-04 09:40:26.061017][2869.724835559], took 0.00min 2025-12-04T09:40:29.7986850Z Running cpp/scalar_test 1/1 ... [2025-12-04 09:40:29.798169][2873.461984519] 2025-12-04T09:40:29.7987492Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:29.7987901Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:29.7988596Z Finished cpp/scalar_test 1/1 ... [2025-12-04 09:40:29.798336][2873.462157212], took 0.00min 2025-12-04T09:40:29.8066813Z Running cpp/tensor_iterator_test 1/1 ... [2025-12-04 09:40:29.806337][2873.470153853] 2025-12-04T09:40:29.8067472Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:29.8068111Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:29.8068772Z Finished cpp/tensor_iterator_test 1/1 ... [2025-12-04 09:40:29.806500][2873.470319056], took 0.00min 2025-12-04T09:40:29.8116945Z Running cpp/undefined_tensor_test 1/1 ... [2025-12-04 09:40:29.811424][2873.475240093] 2025-12-04T09:40:29.8117504Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:29.8118049Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:29.8118900Z Finished cpp/undefined_tensor_test 1/1 ... [2025-12-04 09:40:29.811588][2873.475408416], took 0.00min 2025-12-04T09:40:33.6274721Z Running cpp/wrapdim_test 1/1 ... [2025-12-04 09:40:33.627005][2877.290821325] 2025-12-04T09:40:33.6275550Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:33.6275951Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:33.6276542Z Finished cpp/wrapdim_test 1/1 ... [2025-12-04 09:40:33.627228][2877.291047019], took 0.00min 2025-12-04T09:40:34.5140366Z Running test batch 'tests to run' cost 28.08 seconds 2025-12-04T09:40:35.1092190Z + run_if_exists tensor_interop_test 2025-12-04T09:40:35.1092718Z + local test_name=tensor_interop_test 2025-12-04T09:40:35.1093040Z + [[ -x build/bin/tensor_interop_test ]] 2025-12-04T09:40:35.1093389Z + echo 'Warning: tensor_interop_test does not exist.' 2025-12-04T09:40:35.1094014Z Warning: tensor_interop_test does not exist. 2025-12-04T09:40:35.1094273Z + run_if_exists cudnn_test 2025-12-04T09:40:35.1094480Z + local test_name=cudnn_test 2025-12-04T09:40:35.1094673Z + [[ -x build/bin/cudnn_test ]] 2025-12-04T09:40:35.1094889Z + echo 'Warning: cudnn_test does not exist.' 2025-12-04T09:40:35.1095126Z Warning: cudnn_test does not exist. 2025-12-04T09:40:35.1095338Z + run_if_exists cuda_generator_test 2025-12-04T09:40:35.1095564Z + local test_name=cuda_generator_test 2025-12-04T09:40:35.1095892Z + [[ -x build/bin/cuda_generator_test ]] 2025-12-04T09:40:35.1096198Z + python test/run_test.py --cpp --verbose -i cpp/cuda_generator_test 2025-12-04T09:40:39.4428758Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:40:39.4507478Z Found test times from artifacts 2025-12-04T09:40:39.4825184Z Found test times from artifacts 2025-12-04T09:40:39.4834816Z Running all tests 2025-12-04T09:40:39.4837731Z Running parallel tests on 3 processes 2025-12-04T09:40:39.4838066Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:40:39.4838364Z Serial tests (0): 2025-12-04T09:40:39.4838582Z Parallel tests (1): 2025-12-04T09:40:39.4838825Z cpp/cuda_generator_test 1/1 2025-12-04T09:40:39.4839105Z Name: excluded (est. time: 0.0min) 2025-12-04T09:40:39.4839368Z Serial tests (0): 2025-12-04T09:40:39.4839582Z Parallel tests (0): 2025-12-04T09:40:39.4840384Z Running cpp/cuda_generator_test 1/1 ... [2025-12-04 09:40:39.483785][2883.147606479] 2025-12-04T09:40:39.4840883Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:39.4841271Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:39.4841828Z Finished cpp/cuda_generator_test 1/1 ... [2025-12-04 09:40:39.483913][2883.147735912], took 0.00min 2025-12-04T09:40:40.7122094Z Uploading artifacts took 1.22 seconds 2025-12-04T09:40:43.7253863Z Running cpp/cuda_generator_test 1/1 ... [2025-12-04 09:40:43.724857][2887.388672866] 2025-12-04T09:40:43.7254689Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:43.7255457Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:43.7256230Z Finished cpp/cuda_generator_test 1/1 ... [2025-12-04 09:40:43.725060][2887.3888797], took 0.00min 2025-12-04T09:40:44.6328358Z Running test batch 'tests to run' cost 5.15 seconds 2025-12-04T09:40:45.2505169Z + run_if_exists apply_test 2025-12-04T09:40:45.2505499Z + local test_name=apply_test 2025-12-04T09:40:45.2505763Z + [[ -x build/bin/apply_test ]] 2025-12-04T09:40:45.2506070Z + echo 'Warning: apply_test does not exist.' 2025-12-04T09:40:45.2506371Z Warning: apply_test does not exist. 2025-12-04T09:40:45.2506642Z + run_if_exists stream_test 2025-12-04T09:40:45.2506879Z + local test_name=stream_test 2025-12-04T09:40:45.2507124Z + [[ -x build/bin/stream_test ]] 2025-12-04T09:40:45.2507384Z + echo 'Warning: stream_test does not exist.' 2025-12-04T09:40:45.2507674Z Warning: stream_test does not exist. 2025-12-04T09:40:45.2507943Z + run_if_exists cuda_half_test 2025-12-04T09:40:45.2508185Z + local test_name=cuda_half_test 2025-12-04T09:40:45.2508706Z + [[ -x build/bin/cuda_half_test ]] 2025-12-04T09:40:45.2509072Z + python test/run_test.py --cpp --verbose -i cpp/cuda_half_test 2025-12-04T09:40:49.5932706Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:40:49.6011458Z Found test times from artifacts 2025-12-04T09:40:49.6332446Z Found test times from artifacts 2025-12-04T09:40:49.6341841Z Running all tests 2025-12-04T09:40:49.6344720Z Running parallel tests on 3 processes 2025-12-04T09:40:49.6345143Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:40:49.6345487Z Serial tests (0): 2025-12-04T09:40:49.6345752Z Parallel tests (1): 2025-12-04T09:40:49.6346026Z cpp/cuda_half_test 1/1 2025-12-04T09:40:49.6346317Z Name: excluded (est. time: 0.0min) 2025-12-04T09:40:49.6346778Z Serial tests (0): 2025-12-04T09:40:49.6347253Z Parallel tests (0): 2025-12-04T09:40:49.6347752Z Running cpp/cuda_half_test 1/1 ... [2025-12-04 09:40:49.634467][2893.298288399] 2025-12-04T09:40:49.6348238Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:49.6348661Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:49.6349254Z Finished cpp/cuda_half_test 1/1 ... [2025-12-04 09:40:49.634597][2893.298419421], took 0.00min 2025-12-04T09:40:50.9541672Z Uploading artifacts took 1.31 seconds 2025-12-04T09:40:53.9332082Z Running cpp/cuda_half_test 1/1 ... [2025-12-04 09:40:53.932619][2897.596434962] 2025-12-04T09:40:53.9333001Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:53.9333523Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:53.9334163Z Finished cpp/cuda_half_test 1/1 ... [2025-12-04 09:40:53.932790][2897.596611215], took 0.00min 2025-12-04T09:40:54.8333517Z Running test batch 'tests to run' cost 5.2 seconds 2025-12-04T09:40:55.4547872Z + run_if_exists cuda_vectorized_test 2025-12-04T09:40:55.4548465Z + local test_name=cuda_vectorized_test 2025-12-04T09:40:55.4548811Z + [[ -x build/bin/cuda_vectorized_test ]] 2025-12-04T09:40:55.4549239Z + python test/run_test.py --cpp --verbose -i cpp/cuda_vectorized_test 2025-12-04T09:40:59.7906881Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:40:59.7985587Z Found test times from artifacts 2025-12-04T09:40:59.8305606Z Found test times from artifacts 2025-12-04T09:40:59.8315835Z Running all tests 2025-12-04T09:40:59.8319081Z Running parallel tests on 3 processes 2025-12-04T09:40:59.8319555Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:40:59.8319860Z Serial tests (0): 2025-12-04T09:40:59.8320103Z Parallel tests (1): 2025-12-04T09:40:59.8320366Z cpp/cuda_vectorized_test 1/1 2025-12-04T09:40:59.8320745Z Name: excluded (est. time: 0.0min) 2025-12-04T09:40:59.8320980Z Serial tests (0): 2025-12-04T09:40:59.8321181Z Parallel tests (0): 2025-12-04T09:40:59.8321717Z Running cpp/cuda_vectorized_test 1/1 ... [2025-12-04 09:40:59.831944][2903.495763903] 2025-12-04T09:40:59.8322238Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:40:59.8322577Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:40:59.8323067Z Finished cpp/cuda_vectorized_test 1/1 ... [2025-12-04 09:40:59.832082][2903.495904146], took 0.00min 2025-12-04T09:41:01.2064725Z Uploading artifacts took 1.36 seconds 2025-12-04T09:41:04.1626237Z Running cpp/cuda_vectorized_test 1/1 ... [2025-12-04 09:41:04.162096][2907.825912304] 2025-12-04T09:41:04.1626785Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:04.1627187Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:04.1627800Z Finished cpp/cuda_vectorized_test 1/1 ... [2025-12-04 09:41:04.162261][2907.826083117], took 0.00min 2025-12-04T09:41:05.0457012Z Running test batch 'tests to run' cost 5.21 seconds 2025-12-04T09:41:05.6551521Z + run_if_exists cuda_distributions_test 2025-12-04T09:41:05.6551914Z + local test_name=cuda_distributions_test 2025-12-04T09:41:05.6552573Z + [[ -x build/bin/cuda_distributions_test ]] 2025-12-04T09:41:05.6553041Z + python test/run_test.py --cpp --verbose -i cpp/cuda_distributions_test 2025-12-04T09:41:10.0289536Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:41:10.0368170Z Found test times from artifacts 2025-12-04T09:41:10.0686777Z Found test times from artifacts 2025-12-04T09:41:10.0696529Z Running all tests 2025-12-04T09:41:10.0699427Z Running parallel tests on 3 processes 2025-12-04T09:41:10.0699803Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:41:10.0700131Z Serial tests (0): 2025-12-04T09:41:10.0700381Z Parallel tests (1): 2025-12-04T09:41:10.0700775Z cpp/cuda_distributions_test 1/1 2025-12-04T09:41:10.0701177Z Name: excluded (est. time: 0.0min) 2025-12-04T09:41:10.0701862Z Serial tests (0): 2025-12-04T09:41:10.0702115Z Parallel tests (0): 2025-12-04T09:41:10.0702712Z Running cpp/cuda_distributions_test 1/1 ... [2025-12-04 09:41:10.069941][2913.733761597] 2025-12-04T09:41:10.0703292Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:10.0703715Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:10.0704289Z Finished cpp/cuda_distributions_test 1/1 ... [2025-12-04 09:41:10.070071][2913.733892919], took 0.00min 2025-12-04T09:41:11.3815154Z Uploading artifacts took 1.30 seconds 2025-12-04T09:41:14.2916390Z Running cpp/cuda_distributions_test 1/1 ... [2025-12-04 09:41:14.291110][2917.954925729] 2025-12-04T09:41:14.2916984Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:14.2917362Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:14.2917952Z Finished cpp/cuda_distributions_test 1/1 ... [2025-12-04 09:41:14.291338][2917.955156613], took 0.00min 2025-12-04T09:41:15.2118637Z Running test batch 'tests to run' cost 5.14 seconds 2025-12-04T09:41:15.8279472Z + run_if_exists cuda_optional_test 2025-12-04T09:41:15.8279891Z + local test_name=cuda_optional_test 2025-12-04T09:41:15.8280297Z + [[ -x build/bin/cuda_optional_test ]] 2025-12-04T09:41:15.8280772Z + python test/run_test.py --cpp --verbose -i cpp/cuda_optional_test 2025-12-04T09:41:20.1766900Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:41:20.1844857Z Found test times from artifacts 2025-12-04T09:41:20.2162504Z Found test times from artifacts 2025-12-04T09:41:20.2172202Z Running all tests 2025-12-04T09:41:20.2175667Z Running parallel tests on 3 processes 2025-12-04T09:41:20.2176046Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:41:20.2176417Z Serial tests (0): 2025-12-04T09:41:20.2176614Z Parallel tests (1): 2025-12-04T09:41:20.2176822Z cpp/cuda_optional_test 1/1 2025-12-04T09:41:20.2177068Z Name: excluded (est. time: 0.0min) 2025-12-04T09:41:20.2177403Z Serial tests (0): 2025-12-04T09:41:20.2177621Z Parallel tests (0): 2025-12-04T09:41:20.2177963Z Running cpp/cuda_optional_test 1/1 ... [2025-12-04 09:41:20.217571][2923.881391604] 2025-12-04T09:41:20.2178343Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:20.2178661Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:20.2179128Z Finished cpp/cuda_optional_test 1/1 ... [2025-12-04 09:41:20.217706][2923.881527816], took 0.00min 2025-12-04T09:41:21.5098271Z Uploading artifacts took 1.28 seconds 2025-12-04T09:41:24.5051498Z Running cpp/cuda_optional_test 1/1 ... [2025-12-04 09:41:24.504668][2928.168485068] 2025-12-04T09:41:24.5051998Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:24.5052502Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:24.5053058Z Finished cpp/cuda_optional_test 1/1 ... [2025-12-04 09:41:24.504851][2928.168671821], took 0.00min 2025-12-04T09:41:25.3947124Z Running test batch 'tests to run' cost 5.18 seconds 2025-12-04T09:41:26.0084485Z + run_if_exists cuda_tensor_interop_test 2025-12-04T09:41:26.0084924Z + local test_name=cuda_tensor_interop_test 2025-12-04T09:41:26.0085655Z + [[ -x build/bin/cuda_tensor_interop_test ]] 2025-12-04T09:41:26.0086090Z + echo 'Warning: cuda_tensor_interop_test does not exist.' 2025-12-04T09:41:26.0086522Z Warning: cuda_tensor_interop_test does not exist. 2025-12-04T09:41:26.0087007Z + run_if_exists cuda_complex_test 2025-12-04T09:41:26.0087336Z + local test_name=cuda_complex_test 2025-12-04T09:41:26.0087675Z + [[ -x build/bin/cuda_complex_test ]] 2025-12-04T09:41:26.0088036Z + python test/run_test.py --cpp --verbose -i cpp/cuda_complex_test 2025-12-04T09:41:30.3533140Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:41:30.3611425Z Found test times from artifacts 2025-12-04T09:41:30.3931387Z Found test times from artifacts 2025-12-04T09:41:30.3941099Z Running all tests 2025-12-04T09:41:30.3944765Z Running parallel tests on 3 processes 2025-12-04T09:41:30.3945259Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:41:30.3945630Z Serial tests (0): 2025-12-04T09:41:30.3945881Z Parallel tests (1): 2025-12-04T09:41:30.3946174Z cpp/cuda_complex_test 1/1 2025-12-04T09:41:30.3946598Z Name: excluded (est. time: 0.0min) 2025-12-04T09:41:30.3946903Z Serial tests (0): 2025-12-04T09:41:30.3947196Z Parallel tests (0): 2025-12-04T09:41:30.3947705Z Running cpp/cuda_complex_test 1/1 ... [2025-12-04 09:41:30.394489][2934.058309387] 2025-12-04T09:41:30.3948485Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:30.3948928Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:30.3949434Z Finished cpp/cuda_complex_test 1/1 ... [2025-12-04 09:41:30.394621][2934.058443819], took 0.00min 2025-12-04T09:41:31.5365195Z Uploading artifacts took 1.13 seconds 2025-12-04T09:41:34.5616152Z Running cpp/cuda_complex_test 1/1 ... [2025-12-04 09:41:34.561093][2938.224909485] 2025-12-04T09:41:34.5616954Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:34.5617552Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:34.5618381Z Finished cpp/cuda_complex_test 1/1 ... [2025-12-04 09:41:34.561314][2938.225133959], took 0.00min 2025-12-04T09:41:35.4610771Z Running test batch 'tests to run' cost 5.07 seconds 2025-12-04T09:41:36.0877664Z + run_if_exists cuda_complex_math_test 2025-12-04T09:41:36.0878125Z + local test_name=cuda_complex_math_test 2025-12-04T09:41:36.0878585Z + [[ -x build/bin/cuda_complex_math_test ]] 2025-12-04T09:41:36.0879138Z + python test/run_test.py --cpp --verbose -i cpp/cuda_complex_math_test 2025-12-04T09:41:40.5057096Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:41:40.5135621Z Found test times from artifacts 2025-12-04T09:41:40.5453066Z Found test times from artifacts 2025-12-04T09:41:40.5463072Z Running all tests 2025-12-04T09:41:40.5466023Z Running parallel tests on 3 processes 2025-12-04T09:41:40.5466448Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:41:40.5466884Z Serial tests (0): 2025-12-04T09:41:40.5467210Z Parallel tests (1): 2025-12-04T09:41:40.5467638Z cpp/cuda_complex_math_test 1/1 2025-12-04T09:41:40.5468073Z Name: excluded (est. time: 0.0min) 2025-12-04T09:41:40.5468386Z Serial tests (0): 2025-12-04T09:41:40.5468724Z Parallel tests (0): 2025-12-04T09:41:40.5469222Z Running cpp/cuda_complex_math_test 1/1 ... [2025-12-04 09:41:40.546625][2944.210445241] 2025-12-04T09:41:40.5469742Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:40.5470154Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:40.5470789Z Finished cpp/cuda_complex_math_test 1/1 ... [2025-12-04 09:41:40.546759][2944.210581244], took 0.00min 2025-12-04T09:41:41.9056538Z Uploading artifacts took 1.35 seconds 2025-12-04T09:41:44.8314165Z Running cpp/cuda_complex_math_test 1/1 ... [2025-12-04 09:41:44.830913][2948.494729611] 2025-12-04T09:41:44.8314748Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:44.8315488Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:44.8316114Z Finished cpp/cuda_complex_math_test 1/1 ... [2025-12-04 09:41:44.831131][2948.494950105], took 0.00min 2025-12-04T09:41:45.7402803Z Running test batch 'tests to run' cost 5.19 seconds 2025-12-04T09:41:46.3748607Z + run_if_exists cuda_cub_test 2025-12-04T09:41:46.3748966Z + local test_name=cuda_cub_test 2025-12-04T09:41:46.3756586Z + [[ -x build/bin/cuda_cub_test ]] 2025-12-04T09:41:46.3756963Z + python test/run_test.py --cpp --verbose -i cpp/cuda_cub_test 2025-12-04T09:41:50.7263648Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:41:50.7342470Z Found test times from artifacts 2025-12-04T09:41:50.7663757Z Found test times from artifacts 2025-12-04T09:41:50.7674790Z Running all tests 2025-12-04T09:41:50.7678126Z Running parallel tests on 3 processes 2025-12-04T09:41:50.7678516Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:41:50.7678853Z Serial tests (0): 2025-12-04T09:41:50.7679102Z Parallel tests (1): 2025-12-04T09:41:50.7679375Z cpp/cuda_cub_test 1/1 2025-12-04T09:41:50.7679683Z Name: excluded (est. time: 0.0min) 2025-12-04T09:41:50.7679995Z Serial tests (0): 2025-12-04T09:41:50.7680236Z Parallel tests (0): 2025-12-04T09:41:50.7680638Z Running cpp/cuda_cub_test 1/1 ... [2025-12-04 09:41:50.767845][2954.431664967] 2025-12-04T09:41:50.7681279Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:50.7681942Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:50.7682619Z Finished cpp/cuda_cub_test 1/1 ... [2025-12-04 09:41:50.768000][2954.43182162], took 0.00min 2025-12-04T09:41:52.0239847Z Uploading artifacts took 1.25 seconds 2025-12-04T09:41:55.0546190Z Running cpp/cuda_cub_test 1/1 ... [2025-12-04 09:41:55.054101][2958.71791782] 2025-12-04T09:41:55.0546771Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:55.0547209Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:41:55.0547833Z Finished cpp/cuda_cub_test 1/1 ... [2025-12-04 09:41:55.054272][2958.718093094], took 0.00min 2025-12-04T09:41:55.9172011Z Running test batch 'tests to run' cost 5.15 seconds 2025-12-04T09:41:56.5265442Z + run_if_exists cuda_atomic_ops_test 2025-12-04T09:41:56.5265829Z + local test_name=cuda_atomic_ops_test 2025-12-04T09:41:56.5266163Z + [[ -x build/bin/cuda_atomic_ops_test ]] 2025-12-04T09:41:56.5266607Z + python test/run_test.py --cpp --verbose -i cpp/cuda_atomic_ops_test 2025-12-04T09:42:00.8747112Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:42:00.8825342Z Found test times from artifacts 2025-12-04T09:42:00.9143468Z Found test times from artifacts 2025-12-04T09:42:00.9153135Z Running all tests 2025-12-04T09:42:00.9156215Z Running parallel tests on 3 processes 2025-12-04T09:42:00.9156588Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:42:00.9157033Z Serial tests (0): 2025-12-04T09:42:00.9157297Z Parallel tests (1): 2025-12-04T09:42:00.9157555Z cpp/cuda_atomic_ops_test 1/1 2025-12-04T09:42:00.9157949Z Name: excluded (est. time: 0.0min) 2025-12-04T09:42:00.9158193Z Serial tests (0): 2025-12-04T09:42:00.9158389Z Parallel tests (0): 2025-12-04T09:42:00.9158720Z Running cpp/cuda_atomic_ops_test 1/1 ... [2025-12-04 09:42:00.915622][2964.579443491] 2025-12-04T09:42:00.9159125Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:42:00.9159439Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:42:00.9159912Z Finished cpp/cuda_atomic_ops_test 1/1 ... [2025-12-04 09:42:00.915753][2964.579574833], took 0.00min 2025-12-04T09:42:02.1992603Z Uploading artifacts took 1.27 seconds 2025-12-04T09:42:05.1544288Z Running cpp/cuda_atomic_ops_test 1/1 ... [2025-12-04 09:42:05.153934][2968.817751204] 2025-12-04T09:42:05.1544865Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:42:05.1545249Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:42:05.1546077Z Finished cpp/cuda_atomic_ops_test 1/1 ... [2025-12-04 09:42:05.154112][2968.817932947], took 0.00min 2025-12-04T09:42:06.0575240Z Running test batch 'tests to run' cost 5.14 seconds 2025-12-04T09:42:06.6834894Z + run_if_exists cuda_allocator_test 2025-12-04T09:42:06.6835312Z + local test_name=cuda_allocator_test 2025-12-04T09:42:06.6835651Z + [[ -x build/bin/cuda_allocator_test ]] 2025-12-04T09:42:06.6836103Z + python test/run_test.py --cpp --verbose -i cpp/cuda_allocator_test 2025-12-04T09:42:11.0170019Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:42:11.0248053Z Found test times from artifacts 2025-12-04T09:42:11.0565119Z Found test times from artifacts 2025-12-04T09:42:11.0575263Z Running all tests 2025-12-04T09:42:11.0578469Z Running parallel tests on 3 processes 2025-12-04T09:42:11.0578981Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:42:11.0579402Z Serial tests (0): 2025-12-04T09:42:11.0579641Z Parallel tests (1): 2025-12-04T09:42:11.0579881Z cpp/cuda_allocator_test 1/1 2025-12-04T09:42:11.0580167Z Name: excluded (est. time: 0.0min) 2025-12-04T09:42:11.0580486Z Serial tests (0): 2025-12-04T09:42:11.0580680Z Parallel tests (0): 2025-12-04T09:42:11.0580990Z Running cpp/cuda_allocator_test 1/1 ... [2025-12-04 09:42:11.057852][2974.721673098] 2025-12-04T09:42:11.0581567Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:42:11.0581874Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:42:11.0582319Z Finished cpp/cuda_allocator_test 1/1 ... [2025-12-04 09:42:11.057980][2974.7218027], took 0.00min 2025-12-04T09:42:12.8578117Z Uploading artifacts took 1.79 seconds 2025-12-04T09:42:15.8125386Z Running cpp/cuda_allocator_test 1/1 ... [2025-12-04 09:42:15.812025][2979.475842124] 2025-12-04T09:42:15.8125987Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:42:15.8126395Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:42:15.8127019Z Finished cpp/cuda_allocator_test 1/1 ... [2025-12-04 09:42:15.812198][2979.476018397], took 0.00min 2025-12-04T09:42:16.6908482Z Running test batch 'tests to run' cost 5.63 seconds 2025-12-04T09:42:17.3203707Z + '[' ON == ON ']' 2025-12-04T09:42:17.3204449Z + valgrind --suppressions=/var/lib/jenkins/workspace/aten/tools/valgrind.sup --error-exitcode=1 build/bin/basic '--gtest_filter=-*CUDA' 2025-12-04T09:42:17.3314814Z ==38641== Memcheck, a memory error detector 2025-12-04T09:42:17.3315256Z ==38641== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. 2025-12-04T09:42:17.3315918Z ==38641== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info 2025-12-04T09:42:17.3316834Z ==38641== Command: build/bin/basic --gtest_filter=-*CUDA 2025-12-04T09:42:17.3317319Z ==38641== 2025-12-04T09:42:17.5375369Z ==38641== Warning: set address range perms: large range [0x4a9d000, 0x19b6a000) (defined) 2025-12-04T09:42:17.5376238Z ==38641== Warning: set address range perms: large range [0x5ab7000, 0x16c2e000) (defined) 2025-12-04T09:42:20.1642550Z ==38641== Warning: set address range perms: large range [0x19b6a000, 0x2ce73000) (defined) 2025-12-04T09:42:21.1995902Z ==38641== Warning: set address range perms: large range [0x2e75a000, 0x45c0f000) (noaccess) 2025-12-04T09:42:21.1996822Z ==38641== Warning: set address range perms: large range [0x2e800000, 0x45ab5000) (defined) 2025-12-04T09:42:21.2427115Z ==38641== Warning: set address range perms: large range [0x45ab5000, 0x56e7a000) (noaccess) 2025-12-04T09:42:21.2427991Z ==38641== Warning: set address range perms: large range [0x45c00000, 0x56dc5000) (defined) 2025-12-04T09:42:21.2730559Z ==38641== Warning: set address range perms: large range [0x46bbd000, 0x56dc5000) (defined) 2025-12-04T09:42:21.2844535Z ==38641== Warning: set address range perms: large range [0x59c92000, 0x73439000) (defined) 2025-12-04T09:42:21.2845426Z ==38641== Warning: set address range perms: large range [0x5ec47000, 0x73083000) (defined) 2025-12-04T09:42:22.0370702Z ==38641== Warning: set address range perms: large range [0x73439000, 0x89eab000) (defined) 2025-12-04T09:42:22.0371540Z ==38641== Warning: set address range perms: large range [0x735ef000, 0x89e54000) (defined) 2025-12-04T09:42:22.1542978Z ==38641== Warning: set address range perms: large range [0x99d0b000, 0xcc1fd000) (noaccess) 2025-12-04T09:42:22.1543900Z ==38641== Warning: set address range perms: large range [0x99e00000, 0xcc0f2000) (defined) 2025-12-04T09:43:06.3213939Z Running main() from /var/lib/jenkins/workspace/third_party/googletest/googletest/src/gtest_main.cc 2025-12-04T09:43:06.3445047Z Note: Google Test filter = -*CUDA 2025-12-04T09:43:06.3504415Z [==========] Running 6 tests from 1 test suite. 2025-12-04T09:43:06.3518019Z [----------] Global test environment set-up. 2025-12-04T09:43:06.3567326Z [----------] 6 tests from BasicTest 2025-12-04T09:43:06.3587559Z [ RUN ] BasicTest.BasicTestCPU 2025-12-04T09:43:06.5894300Z ==38641== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints. 2025-12-04T09:43:06.5894832Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:06.5895319Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:06.5899668Z ==38641== Warning: noted but unhandled ioctl 0x4b with no size/direction hints. 2025-12-04T09:43:06.5900392Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:06.5900890Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:06.5905854Z ==38641== Warning: noted but unhandled ioctl 0x27 with no size/direction hints. 2025-12-04T09:43:06.5906225Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:06.5906614Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:06.7328896Z ==38641== Warning: noted but unhandled ioctl 0x25 with no size/direction hints. 2025-12-04T09:43:06.7329393Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:06.7329867Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:06.8277151Z ==38641== Warning: noted but unhandled ioctl 0x46 with no size/direction hints. 2025-12-04T09:43:06.8277669Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:06.8278175Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:06.8285653Z ==38641== Warning: noted but unhandled ioctl 0x17 with no size/direction hints. 2025-12-04T09:43:06.8286136Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:06.8286625Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:06.8378713Z ==38641== Warning: set address range perms: large range [0x200000000, 0x300200000) (noaccess) 2025-12-04T09:43:08.5928009Z 1173 ms 2025-12-04T09:43:08.6666795Z 38 ms 2025-12-04T09:43:08.7232065Z 49 ms 2025-12-04T09:43:11.3305763Z [ OK ] BasicTest.BasicTestCPU (4970 ms) 2025-12-04T09:43:11.4271957Z [ RUN ] BasicTest.BasicTestHalfCPU 2025-12-04T09:43:11.9663292Z 450 ms 2025-12-04T09:43:12.0000860Z 29 ms 2025-12-04T09:43:12.1262374Z 48 ms 2025-12-04T09:43:12.3464323Z [ OK ] BasicTest.BasicTestHalfCPU (865 ms) 2025-12-04T09:43:12.3464761Z [ RUN ] BasicTest.FactoryMethodsTest 2025-12-04T09:43:12.4312748Z ==38641== Warning: noted but unhandled ioctl 0x19 with no size/direction hints. 2025-12-04T09:43:12.4313241Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:12.4313745Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:12.4582486Z ==38641== Warning: noted but unhandled ioctl 0x49 with no size/direction hints. 2025-12-04T09:43:12.4583014Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:12.4583865Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:12.4591446Z ==38641== Warning: noted but unhandled ioctl 0x21 with no size/direction hints. 2025-12-04T09:43:12.4591842Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:12.4592428Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:12.5831604Z ==38641== Warning: noted but unhandled ioctl 0x1b with no size/direction hints. 2025-12-04T09:43:12.5832092Z ==38641== This could cause spurious value errors to appear. 2025-12-04T09:43:12.5832581Z ==38641== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T09:43:13.5580628Z [ OK ] BasicTest.FactoryMethodsTest (1211 ms) 2025-12-04T09:43:13.5581039Z [ RUN ] BasicTest.BasicStdTestCPU 2025-12-04T09:43:13.6493876Z Simple example: called once 2025-12-04T09:43:13.6817960Z Didn't throw, call_once will not attempt again 2025-12-04T09:43:13.6844744Z [ OK ] BasicTest.BasicStdTestCPU (126 ms) 2025-12-04T09:43:13.6845247Z [ RUN ] BasicTest.TestForBlobResizeCPU 2025-12-04T09:43:13.7005406Z [ OK ] BasicTest.TestForBlobResizeCPU (15 ms) 2025-12-04T09:43:13.7005947Z [ RUN ] BasicTest.TestForBlobStridesResizeCPU 2025-12-04T09:43:13.7041492Z [ OK ] BasicTest.TestForBlobStridesResizeCPU (3 ms) 2025-12-04T09:43:13.7058713Z [----------] 6 tests from BasicTest (7346 ms total) 2025-12-04T09:43:13.7059113Z 2025-12-04T09:43:13.7070006Z [----------] Global test environment tear-down 2025-12-04T09:43:13.7090720Z [==========] 6 tests from 1 test suite ran. (7366 ms total) 2025-12-04T09:43:13.7102028Z [ PASSED ] 6 tests. 2025-12-04T09:43:15.9808676Z ==38641== 2025-12-04T09:43:15.9823756Z ==38641== HEAP SUMMARY: 2025-12-04T09:43:15.9824224Z ==38641== in use at exit: 22,414,310 bytes in 24,816 blocks 2025-12-04T09:43:15.9824736Z ==38641== total heap usage: 1,059,881 allocs, 1,035,065 frees, 280,357,338 bytes allocated 2025-12-04T09:43:15.9825122Z ==38641== 2025-12-04T09:43:17.4822507Z ==38641== LEAK SUMMARY: 2025-12-04T09:43:17.4822945Z ==38641== definitely lost: 288 bytes in 3 blocks 2025-12-04T09:43:17.4823299Z ==38641== indirectly lost: 192 bytes in 2 blocks 2025-12-04T09:43:17.4823657Z ==38641== possibly lost: 97,200 bytes in 186 blocks 2025-12-04T09:43:17.4824014Z ==38641== still reachable: 22,316,630 bytes in 24,625 blocks 2025-12-04T09:43:17.4824368Z ==38641== suppressed: 0 bytes in 0 blocks 2025-12-04T09:43:17.4824758Z ==38641== Rerun with --leak-check=full to see details of leaked memory 2025-12-04T09:43:17.4825098Z ==38641== 2025-12-04T09:43:17.4825375Z ==38641== For lists of detected and suppressed errors, rerun with: -s 2025-12-04T09:43:17.4825830Z ==38641== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4) 2025-12-04T09:43:17.6853223Z + [[ -x build/bin/tensor_interop_test ]] 2025-12-04T09:43:17.6854831Z + [[ -n '' ]] 2025-12-04T09:43:17.6855049Z + assert_git_not_dirty 2025-12-04T09:43:17.6855412Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *rocm* ]] 2025-12-04T09:43:17.6855870Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *xla* ]] 2025-12-04T09:43:17.6862312Z ++ git status --porcelain 2025-12-04T09:43:17.6863949Z ++ grep -v '?? third_party' 2025-12-04T09:43:18.0184403Z ++ true 2025-12-04T09:43:18.0187861Z + git_status= 2025-12-04T09:43:18.0188469Z + [[ -n '' ]] 2025-12-04T09:43:18.0188967Z + test_libtorch 1 2025-12-04T09:43:18.0189273Z + local SHARD=1 2025-12-04T09:43:18.0189464Z + [[ default != \s\l\o\w ]] 2025-12-04T09:43:18.0189672Z + echo 'Testing libtorch' 2025-12-04T09:43:18.0189921Z Testing libtorch 2025-12-04T09:43:18.0190532Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libbackend_with_compiler.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0203833Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libjitbackend_test.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0216421Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libcaffe2_nvrtc.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0229049Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10d_cuda_test.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0243467Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libshm /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libshm.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libshm_windows /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0258865Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_global_deps.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_nvshmem.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_python.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorchbind_test.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0270946Z + ln -sf '/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libnvfuser*' /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0283213Z + export CPP_TESTS_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0283927Z + CPP_TESTS_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:43:18.0284300Z + [[ -z 1 ]] 2025-12-04T09:43:18.0284480Z + [[ 1 == \1 ]] 2025-12-04T09:43:18.0284686Z + test_libtorch_api 2025-12-04T09:43:18.0284961Z + MNIST_DIR=/var/lib/jenkins/workspace/test/cpp/api/mnist 2025-12-04T09:43:18.0285456Z + python tools/download_mnist.py --quiet -d /var/lib/jenkins/workspace/test/cpp/api/mnist 2025-12-04T09:43:18.0695412Z Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz ... 2025-12-04T09:43:18.4043178Z Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz ... 2025-12-04T09:43:18.4806919Z Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz ... 2025-12-04T09:43:18.5812735Z Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz ... 2025-12-04T09:43:18.6223071Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *asan* ]] 2025-12-04T09:43:18.6223675Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *slow-gradcheck* ]] 2025-12-04T09:43:18.6224058Z + OMP_NUM_THREADS=2 2025-12-04T09:43:18.6224413Z + TORCH_CPP_TEST_MNIST_PATH=/var/lib/jenkins/workspace/test/cpp/api/mnist 2025-12-04T09:43:22.8380572Z + python test/run_test.py --cpp --verbose -i cpp/test_api -k 'not IMethodTest' 2025-12-04T09:43:22.8381486Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:43:22.8458508Z Found test times from artifacts 2025-12-04T09:43:22.8773505Z Found test times from artifacts 2025-12-04T09:43:22.8783215Z Running all tests 2025-12-04T09:43:22.8786580Z Running parallel tests on 3 processes 2025-12-04T09:43:22.8787005Z Name: tests to run (est. time: 0.0min) 2025-12-04T09:43:22.8787274Z Serial tests (0): 2025-12-04T09:43:22.8787485Z Parallel tests (1): 2025-12-04T09:43:22.8787710Z cpp/test_api 1/1 2025-12-04T09:43:22.8787931Z Name: excluded (est. time: 0.0min) 2025-12-04T09:43:22.8788179Z Serial tests (0): 2025-12-04T09:43:22.8788393Z Parallel tests (0): 2025-12-04T09:43:22.8792017Z Running cpp/test_api 1/1 ... [2025-12-04 09:43:22.879039][3046.542858318] 2025-12-04T09:43:22.8792612Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:43:22.8792995Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:43:22.8793424Z Finished cpp/test_api 1/1 ... [2025-12-04 09:43:22.879172][3046.54299442], took 0.00min 2025-12-04T09:43:24.4497358Z Uploading artifacts took 1.56 seconds 2025-12-04T09:43:27.3204666Z Running cpp/test_api 1/1 ... [2025-12-04 09:43:27.319977][3050.983790758] 2025-12-04T09:43:27.3205107Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:43:27.3205460Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T09:43:27.3205956Z Finished cpp/test_api 1/1 ... [2025-12-04 09:43:27.320192][3050.984010832], took 0.00min 2025-12-04T09:43:28.1963663Z Running test batch 'tests to run' cost 5.32 seconds 2025-12-04T09:43:28.8016686Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *android* ]] 2025-12-04T09:43:28.8017436Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *cuda* ]] 2025-12-04T09:43:28.8017751Z + [[ -z 1 ]] 2025-12-04T09:43:28.8017952Z + [[ 1 == \2 ]] 2025-12-04T09:43:28.8018152Z + assert_git_not_dirty 2025-12-04T09:43:28.8018415Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *rocm* ]] 2025-12-04T09:43:28.8018765Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 != *xla* ]] 2025-12-04T09:43:28.8023603Z ++ git status --porcelain 2025-12-04T09:43:28.8024003Z ++ grep -v '?? third_party' 2025-12-04T09:43:29.1357594Z ++ true 2025-12-04T09:43:29.1358749Z + git_status= 2025-12-04T09:43:29.1359023Z + [[ -n '' ]] 2025-12-04T09:43:29.1359789Z + [[ linux-jammy-cuda12.8-py3.10-gcc11 == *xpu* ]] 2025-12-04T09:43:29.1360849Z + sccache_epilogue 2025-12-04T09:43:29.1361135Z + echo '::group::Sccache Compilation Log' 2025-12-04T09:43:29.1362130Z ##[group]Sccache Compilation Log 2025-12-04T09:43:29.1362411Z + echo '=================== sccache compilation log ===================' 2025-12-04T09:43:29.1362720Z =================== sccache compilation log =================== 2025-12-04T09:43:29.1363178Z + python /var/lib/jenkins/workspace/.ci/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 2025-12-04T09:43:29.1486331Z + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 2025-12-04T09:43:29.1486963Z =========== If your build fails, please take a look at the log above for possible reasons =========== 2025-12-04T09:43:29.1487386Z + sccache --show-stats 2025-12-04T09:43:29.1510753Z Compile requests 238 2025-12-04T09:43:29.1511070Z Compile requests executed 118 2025-12-04T09:43:29.1511346Z Cache hits 74 2025-12-04T09:43:29.1511630Z Cache hits (C/C++) 74 2025-12-04T09:43:29.1511887Z Cache misses 44 2025-12-04T09:43:29.1512143Z Cache misses (C/C++) 44 2025-12-04T09:43:29.1512415Z Cache hits rate 62.71 % 2025-12-04T09:43:29.1512687Z Cache hits rate (C/C++) 62.71 % 2025-12-04T09:43:29.1512947Z Cache timeouts 0 2025-12-04T09:43:29.1513214Z Cache read errors 0 2025-12-04T09:43:29.1513464Z Forced recaches 0 2025-12-04T09:43:29.1513741Z Cache write errors 0 2025-12-04T09:43:29.1514124Z Cache errors 0 2025-12-04T09:43:29.1514382Z Compilations 44 2025-12-04T09:43:29.1514639Z Compilation failures 0 2025-12-04T09:43:29.1514926Z Non-cacheable compilations 0 2025-12-04T09:43:29.1515204Z Non-cacheable calls 0 2025-12-04T09:43:29.1515465Z Non-compilation calls 120 2025-12-04T09:43:29.1515733Z Unsupported compiler calls 0 2025-12-04T09:43:29.1516020Z Average cache write 0.039 s 2025-12-04T09:43:29.1516301Z Average compiler 11.721 s 2025-12-04T09:43:29.1516573Z Average cache read hit 0.059 s 2025-12-04T09:43:29.1516856Z Failed distributed compilations 0 2025-12-04T09:43:29.1517431Z Cache location s3, name: ossci-compiler-cache-circleci-v2, prefix: / 2025-12-04T09:43:29.1517849Z Version (client) 0.10.0 2025-12-04T09:43:29.1518119Z + sccache --stop-server 2025-12-04T09:43:29.1532520Z Stopping sccache server... 2025-12-04T09:43:29.1535719Z Compile requests 238 2025-12-04T09:43:29.1536017Z Compile requests executed 118 2025-12-04T09:43:29.1536303Z Cache hits 74 2025-12-04T09:43:29.1536555Z Cache hits (C/C++) 74 2025-12-04T09:43:29.1536810Z Cache misses 44 2025-12-04T09:43:29.1537065Z Cache misses (C/C++) 44 2025-12-04T09:43:29.1537322Z Cache hits rate 62.71 % 2025-12-04T09:43:29.1537584Z Cache hits rate (C/C++) 62.71 % 2025-12-04T09:43:29.1537848Z Cache timeouts 0 2025-12-04T09:43:29.1538250Z Cache read errors 0 2025-12-04T09:43:29.1538501Z Forced recaches 0 2025-12-04T09:43:29.1538762Z Cache write errors 0 2025-12-04T09:43:29.1539014Z Cache errors 0 2025-12-04T09:43:29.1539268Z Compilations 44 2025-12-04T09:43:29.1539739Z Compilation failures 0 2025-12-04T09:43:29.1540163Z Non-cacheable compilations 0 2025-12-04T09:43:29.1540528Z Non-cacheable calls 0 2025-12-04T09:43:29.1540982Z Non-compilation calls 120 2025-12-04T09:43:29.1541276Z Unsupported compiler calls 0 2025-12-04T09:43:29.1541552Z Average cache write 0.039 s 2025-12-04T09:43:29.1541790Z Average compiler 11.721 s 2025-12-04T09:43:29.1542005Z Average cache read hit 0.059 s 2025-12-04T09:43:29.1542230Z Failed distributed compilations 0 2025-12-04T09:43:29.1542541Z Cache location s3, name: ossci-compiler-cache-circleci-v2, prefix: / 2025-12-04T09:43:29.1542872Z Version (client) 0.10.0 2025-12-04T09:43:29.1543102Z + echo ::endgroup:: 2025-12-04T09:43:29.1543457Z ##[endgroup] 2025-12-04T09:43:29.1543616Z + cleanup_workspace 2025-12-04T09:43:29.1543956Z + echo 'sudo may print the following warning message that can be ignored. The chown command will still run.' 2025-12-04T09:43:29.1544499Z sudo may print the following warning message that can be ignored. The chown command will still run. 2025-12-04T09:43:29.1544937Z + echo ' sudo: setrlimit(RLIMIT_STACK): Operation not permitted' 2025-12-04T09:43:29.1545260Z sudo: setrlimit(RLIMIT_STACK): Operation not permitted 2025-12-04T09:43:29.1545666Z + echo 'For more details refer to https://github.com/sudo-project/sudo/issues/42' 2025-12-04T09:43:29.1546092Z For more details refer to https://github.com/sudo-project/sudo/issues/42 2025-12-04T09:43:29.1546428Z + sudo chown -R 1000 /var/lib/jenkins/workspace 2025-12-04T09:43:30.1432595Z ##[group]Run pytorch/test-infra/.github/actions/upload-benchmark-results@main 2025-12-04T09:43:30.1432978Z with: 2025-12-04T09:43:30.1433173Z benchmark-results-dir: test/test-reports 2025-12-04T09:43:30.1433448Z dry-run: false 2025-12-04T09:43:30.1433729Z schema-version: v3 2025-12-04T09:43:30.1434169Z github-token: *** 2025-12-04T09:43:30.1434359Z env: 2025-12-04T09:43:30.1434505Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:43:30.1434701Z HAS_NVIDIA_GPU: true 2025-12-04T09:43:30.1434928Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:43:30.1435340Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:43:30.1435692Z ##[endgroup] 2025-12-04T09:43:30.1450809Z ##[group]Run set -eux 2025-12-04T09:43:30.1451014Z set -eux 2025-12-04T09:43:30.1451170Z  2025-12-04T09:43:30.1451319Z if [[ -n "" ]]; then 2025-12-04T09:43:30.1451515Z  source "" 2025-12-04T09:43:30.1451719Z fi 2025-12-04T09:43:30.1451977Z python3 -mpip install boto3==1.35.33 psutil==7.0.0 pynvml==12.0.0 2025-12-04T09:43:30.1452363Z  2025-12-04T09:43:30.1452521Z DEVICE_NAME="" 2025-12-04T09:43:30.1452704Z DEVICE_TYPE="" 2025-12-04T09:43:30.1452876Z  2025-12-04T09:43:30.1453065Z if command -v nvidia-smi; then 2025-12-04T09:43:30.1453653Z  # NB: I'm using PyTorch here to get the device name, however, it needs to 2025-12-04T09:43:30.1454074Z  # install the correct version of PyTorch manually for now. Any PyTorch 2025-12-04T09:43:30.1454455Z  # version is fine, I just use 2.7.1 to satify PYPIDEP linter 2025-12-04T09:43:30.1454787Z  python3 -mpip install torch==2.7.1 2025-12-04T09:43:30.1455029Z elif command -v rocminfo; then 2025-12-04T09:43:30.1455337Z  # NB: Installing torch on ROCm runner with pip here causes CI to fail 2025-12-04T09:43:30.1455732Z  # with a memoryview is too large error only on MI300 runners. Is pip 2025-12-04T09:43:30.1456236Z  # version on ROCm runner there too old? As a workaround, let's use the 2025-12-04T09:43:30.1456582Z  # GPU device name coming from rocminfo instead 2025-12-04T09:43:30.1456841Z  DEVICE_NAME=rocm 2025-12-04T09:43:30.1457184Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2025-12-04T09:43:30.1457531Z fi 2025-12-04T09:43:30.1457681Z  2025-12-04T09:43:30.1457877Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2025-12-04T09:43:30.1458163Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2025-12-04T09:43:30.1470714Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:43:30.1470997Z env: 2025-12-04T09:43:30.1471168Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:43:30.1471357Z HAS_NVIDIA_GPU: true 2025-12-04T09:43:30.1471592Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:43:30.1472012Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:43:30.1472366Z ##[endgroup] 2025-12-04T09:43:30.1508853Z + [[ -n '' ]] 2025-12-04T09:43:30.1509193Z + python3 -mpip install boto3==1.35.33 psutil==7.0.0 pynvml==12.0.0 2025-12-04T09:43:30.3819746Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:43:31.4342895Z Collecting boto3==1.35.33 2025-12-04T09:43:31.4514340Z Downloading boto3-1.35.33-py3-none-any.whl (139 kB) 2025-12-04T09:43:31.7440620Z Collecting psutil==7.0.0 2025-12-04T09:43:31.7477692Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2025-12-04T09:43:31.7776278Z Collecting pynvml==12.0.0 2025-12-04T09:43:31.7813414Z Downloading pynvml-12.0.0-py3-none-any.whl (26 kB) 2025-12-04T09:43:31.8244364Z Collecting s3transfer<0.11.0,>=0.10.0 2025-12-04T09:43:31.8276742Z Downloading s3transfer-0.10.4-py3-none-any.whl (83 kB) 2025-12-04T09:43:31.8325234Z Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /usr/lib/python3.9/site-packages (from boto3==1.35.33) (0.10.0) 2025-12-04T09:43:32.8925318Z Collecting botocore<1.36.0,>=1.35.33 2025-12-04T09:43:32.8964000Z Downloading botocore-1.35.99-py3-none-any.whl (13.3 MB) 2025-12-04T09:43:33.1118206Z Collecting nvidia-ml-py<13.0.0a0,>=12.0.0 2025-12-04T09:43:33.1154529Z Downloading nvidia_ml_py-12.575.51-py3-none-any.whl (47 kB) 2025-12-04T09:43:33.1233014Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.33->boto3==1.35.33) (2.8.1) 2025-12-04T09:43:33.1242502Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /usr/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.33->boto3==1.35.33) (1.25.10) 2025-12-04T09:43:33.2521488Z Requirement already satisfied: six>=1.5 in /usr/lib/python3.9/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.36.0,>=1.35.33->boto3==1.35.33) (1.15.0) 2025-12-04T09:43:33.3689416Z Installing collected packages: botocore, s3transfer, nvidia-ml-py, pynvml, psutil, boto3 2025-12-04T09:43:33.8775159Z Attempting uninstall: nvidia-ml-py 2025-12-04T09:43:33.8776882Z Found existing installation: nvidia-ml-py 11.525.84 2025-12-04T09:43:33.8790592Z Uninstalling nvidia-ml-py-11.525.84: 2025-12-04T09:43:33.8980475Z Successfully uninstalled nvidia-ml-py-11.525.84 2025-12-04T09:43:33.9493295Z Attempting uninstall: psutil 2025-12-04T09:43:33.9494263Z Found existing installation: psutil 5.9.8 2025-12-04T09:43:33.9572112Z Uninstalling psutil-5.9.8: 2025-12-04T09:43:33.9579361Z Successfully uninstalled psutil-5.9.8 2025-12-04T09:43:34.1053451Z Successfully installed boto3-1.35.33 botocore-1.35.99 nvidia-ml-py-12.575.51 psutil-7.0.0 pynvml-12.0.0 s3transfer-0.10.4 2025-12-04T09:43:34.1988160Z + DEVICE_NAME= 2025-12-04T09:43:34.1988451Z + DEVICE_TYPE= 2025-12-04T09:43:34.1988686Z + command -v nvidia-smi 2025-12-04T09:43:34.1989331Z + python3 -mpip install torch==2.7.1 2025-12-04T09:43:34.1989632Z /usr/bin/nvidia-smi 2025-12-04T09:43:34.4078841Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:43:34.6599832Z Collecting torch==2.7.1 2025-12-04T09:43:34.6752625Z Downloading torch-2.7.1-cp39-cp39-manylinux_2_28_x86_64.whl (821.1 MB) 2025-12-04T09:43:46.7242999Z Collecting nvidia-nccl-cu12==2.26.2 2025-12-04T09:43:46.7279749Z Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-12-04T09:43:48.9627568Z Collecting nvidia-nvtx-cu12==12.6.77 2025-12-04T09:43:48.9658103Z Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-12-04T09:43:48.9963487Z Collecting nvidia-curand-cu12==10.3.7.77 2025-12-04T09:43:49.0000150Z Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (56.3 MB) 2025-12-04T09:43:49.6224003Z Collecting nvidia-cuda-runtime-cu12==12.6.77 2025-12-04T09:43:49.6257693Z Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (897 kB) 2025-12-04T09:43:49.6524622Z Collecting nvidia-cufile-cu12==1.11.1.6 2025-12-04T09:43:49.6556666Z Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-12-04T09:43:49.6709755Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/.local/lib/python3.9/site-packages (from torch==2.7.1) (4.15.0) 2025-12-04T09:43:49.6979446Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 2025-12-04T09:43:49.7015835Z Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-12-04T09:43:49.9649054Z Collecting filelock 2025-12-04T09:43:49.9681362Z Downloading filelock-3.19.1-py3-none-any.whl (15 kB) 2025-12-04T09:43:49.9975416Z Collecting nvidia-cusolver-cu12==11.7.1.2 2025-12-04T09:43:50.0007193Z Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (158.2 MB) 2025-12-04T09:43:51.6345298Z Collecting nvidia-nvjitlink-cu12==12.6.85 2025-12-04T09:43:51.6378825Z Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-12-04T09:43:51.8675706Z Collecting networkx 2025-12-04T09:43:51.8709272Z Downloading networkx-3.2.1-py3-none-any.whl (1.6 MB) 2025-12-04T09:43:51.9281944Z Collecting sympy>=1.13.3 2025-12-04T09:43:51.9311581Z Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB) 2025-12-04T09:43:52.0194453Z Collecting nvidia-cufft-cu12==11.3.0.4 2025-12-04T09:43:52.0217795Z Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (200.2 MB) 2025-12-04T09:43:54.3497570Z Collecting nvidia-cublas-cu12==12.6.4.1 2025-12-04T09:43:54.3561219Z Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-12-04T09:43:59.7588659Z Collecting nvidia-cusparse-cu12==12.5.4.2 2025-12-04T09:43:59.7624478Z Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (216.6 MB) 2025-12-04T09:44:02.2024312Z Requirement already satisfied: jinja2 in /usr/lib/python3.9/site-packages (from torch==2.7.1) (2.11.3) 2025-12-04T09:44:02.2190733Z Collecting nvidia-cusparselt-cu12==0.6.3 2025-12-04T09:44:02.2240675Z Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-12-04T09:44:03.9770937Z Collecting triton==3.3.1 2025-12-04T09:44:03.9824128Z Downloading triton-3.3.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (155.6 MB) 2025-12-04T09:44:05.5126193Z Collecting nvidia-cudnn-cu12==9.5.1.17 2025-12-04T09:44:05.5158731Z Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-12-04T09:44:13.5899510Z Collecting nvidia-cuda-cupti-cu12==12.6.80 2025-12-04T09:44:13.5962785Z Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.9 MB) 2025-12-04T09:44:13.7430678Z Collecting fsspec 2025-12-04T09:44:13.7459315Z Downloading fsspec-2025.10.0-py3-none-any.whl (200 kB) 2025-12-04T09:44:13.7745899Z Requirement already satisfied: setuptools>=40.8.0 in /usr/lib/python3.9/site-packages (from triton==3.3.1->torch==2.7.1) (59.6.0) 2025-12-04T09:44:13.8017455Z Collecting mpmath<1.4,>=1.1.0 2025-12-04T09:44:13.8050356Z Downloading mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-12-04T09:44:13.8824932Z Requirement already satisfied: MarkupSafe>=0.23 in /usr/lib64/python3.9/site-packages (from jinja2->torch==2.7.1) (1.1.1) 2025-12-04T09:44:14.1772016Z Installing collected packages: nvidia-nvjitlink-cu12, nvidia-cusparse-cu12, nvidia-cublas-cu12, mpmath, triton, sympy, nvidia-nvtx-cu12, nvidia-nccl-cu12, nvidia-cusparselt-cu12, nvidia-cusolver-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, networkx, fsspec, filelock, torch 2025-12-04T09:44:21.8626420Z WARNING: The scripts proton and proton-viewer are installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T09:44:21.8627233Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T09:44:25.2765018Z WARNING: The script isympy is installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T09:44:25.2765844Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T09:44:51.6139581Z WARNING: The scripts torchfrtrace and torchrun are installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T09:44:51.6140422Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T09:44:51.7699378Z Successfully installed filelock-3.19.1 fsspec-2025.10.0 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 sympy-1.14.0 torch-2.7.1 triton-3.3.1 2025-12-04T09:44:52.2742777Z + echo DEVICE_NAME= 2025-12-04T09:44:52.2743080Z + echo DEVICE_TYPE= 2025-12-04T09:44:52.2768488Z ##[group]Run set -eux 2025-12-04T09:44:52.2768689Z set -eux 2025-12-04T09:44:52.2768846Z  2025-12-04T09:44:52.2769013Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2025-12-04T09:44:52.2769279Z  echo "Missing github-token input" 2025-12-04T09:44:52.2769504Z  exit 1 2025-12-04T09:44:52.2769665Z fi 2025-12-04T09:44:52.2778985Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:52.2779268Z env: 2025-12-04T09:44:52.2779424Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:52.2779641Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:52.2779871Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:52.2780276Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:52.2780649Z DEVICE_NAME: 2025-12-04T09:44:52.2780809Z DEVICE_TYPE: 2025-12-04T09:44:52.2781160Z GITHUB_TOKEN: *** 2025-12-04T09:44:52.2781471Z ##[endgroup] 2025-12-04T09:44:52.2809894Z + [[ -z *** ]] 2025-12-04T09:44:52.2891147Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2025-12-04T09:44:52.2891449Z with: 2025-12-04T09:44:52.2891711Z github-token: *** 2025-12-04T09:44:52.2891891Z env: 2025-12-04T09:44:52.2892039Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:52.2892231Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:52.2892551Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:52.2892959Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:52.2893472Z DEVICE_NAME: 2025-12-04T09:44:52.2893640Z DEVICE_TYPE: 2025-12-04T09:44:52.2893789Z ##[endgroup] 2025-12-04T09:44:52.2976434Z ##[group]Run set -eux 2025-12-04T09:44:52.2976620Z set -eux 2025-12-04T09:44:52.2976784Z  2025-12-04T09:44:52.2977130Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2025-12-04T09:44:52.2984255Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:52.2984523Z env: 2025-12-04T09:44:52.2984689Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:52.2984881Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:52.2985101Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:52.2985509Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:52.2985869Z DEVICE_NAME: 2025-12-04T09:44:52.2986031Z DEVICE_TYPE: 2025-12-04T09:44:52.2986303Z GITHUB_TOKEN: *** 2025-12-04T09:44:52.2986477Z ##[endgroup] 2025-12-04T09:44:52.3011996Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 19922768520 i-004b897e3464f6ebc 2025-12-04T09:44:54.1062510Z setting job-id=57116084853 2025-12-04T09:44:54.1063205Z setting job-name=linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:44:54.1163550Z ##[group]Run set -eux 2025-12-04T09:44:54.1163752Z set -eux 2025-12-04T09:44:54.1163900Z  2025-12-04T09:44:54.1164057Z if [[ -n "" ]]; then 2025-12-04T09:44:54.1164246Z  source "" 2025-12-04T09:44:54.1164402Z fi 2025-12-04T09:44:54.1164551Z  2025-12-04T09:44:54.1164834Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2025-12-04T09:44:54.1165213Z  --schema-version "${SCHEMA_VERSION}" \ 2025-12-04T09:44:54.1165458Z  --repo "${REPO}" \ 2025-12-04T09:44:54.1165861Z  --head-branch "${HEAD_BRANCH}" \ 2025-12-04T09:44:54.1166111Z  --head-sha "${HEAD_SHA}" \ 2025-12-04T09:44:54.1166347Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2025-12-04T09:44:54.1166596Z  --run-attempt "${RUN_ATTEMPT}" \ 2025-12-04T09:44:54.1166820Z  --job-id "${JOB_ID}" \ 2025-12-04T09:44:54.1167038Z  --job-name "${JOB_NAME}" 2025-12-04T09:44:54.1174591Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:54.1174879Z env: 2025-12-04T09:44:54.1175033Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:54.1175213Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:54.1175435Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:54.1175836Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:54.1176207Z DEVICE_NAME: 2025-12-04T09:44:54.1176364Z DEVICE_TYPE: 2025-12-04T09:44:54.1176533Z SCHEMA_VERSION: v3 2025-12-04T09:44:54.1176735Z REPO: pytorch/pytorch 2025-12-04T09:44:54.1176916Z HEAD_BRANCH: refs/heads/main 2025-12-04T09:44:54.1177161Z HEAD_SHA: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:44:54.1177411Z WORKFLOW_RUN_ID: 19922768520 2025-12-04T09:44:54.1177587Z RUN_ATTEMPT: 1 2025-12-04T09:44:54.1177740Z JOB_ID: 57116084853 2025-12-04T09:44:54.1178169Z JOB_NAME: linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:44:54.1178749Z ##[endgroup] 2025-12-04T09:44:54.1203745Z + [[ -n '' ]] 2025-12-04T09:44:54.1205859Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/pytorch --head-branch refs/heads/main --head-sha ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 --workflow-id 19922768520 --run-attempt 1 --job-id 57116084853 --job-name 'linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests)' 2025-12-04T09:44:54.1570596Z ##[group]Run set -eux 2025-12-04T09:44:54.1570790Z set -eux 2025-12-04T09:44:54.1570946Z  2025-12-04T09:44:54.1571091Z if [[ -n "" ]]; then 2025-12-04T09:44:54.1571289Z  source "" 2025-12-04T09:44:54.1571462Z fi 2025-12-04T09:44:54.1571613Z  2025-12-04T09:44:54.1571897Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2025-12-04T09:44:54.1578995Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:54.1579273Z env: 2025-12-04T09:44:54.1579422Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:54.1579614Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:54.1579848Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:54.1580248Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:54.1580620Z DEVICE_NAME: 2025-12-04T09:44:54.1580799Z DEVICE_TYPE: 2025-12-04T09:44:54.1580954Z ##[endgroup] 2025-12-04T09:44:54.1609534Z + [[ -n '' ]] 2025-12-04T09:44:54.1610274Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/benchmarks/gather_runners_info.py 2025-12-04T09:44:54.9383265Z /home/ec2-user/.local/lib/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) 2025-12-04T09:44:54.9385165Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T09:44:55.7430787Z ##[group]Run set -eux 2025-12-04T09:44:55.7430999Z set -eux 2025-12-04T09:44:55.7431166Z  2025-12-04T09:44:55.7431346Z # TODO (huydhn): Implement this part 2025-12-04T09:44:55.7431623Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:44:55.7440030Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:55.7440347Z env: 2025-12-04T09:44:55.7440507Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:55.7440706Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:55.7440937Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:55.7441351Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:55.7441730Z DEVICE_NAME: 2025-12-04T09:44:55.7441890Z DEVICE_TYPE: 2025-12-04T09:44:55.7442049Z ##[endgroup] 2025-12-04T09:44:55.7469399Z + echo 'dependencies={}' 2025-12-04T09:44:55.7550694Z ##[group]Run set -eux 2025-12-04T09:44:55.7550905Z set -eux 2025-12-04T09:44:55.7551063Z  2025-12-04T09:44:55.7551222Z if [[ -n "" ]]; then 2025-12-04T09:44:55.7551410Z  source "" 2025-12-04T09:44:55.7551584Z fi 2025-12-04T09:44:55.7551730Z  2025-12-04T09:44:55.7551908Z if [[ ! -d "${BENCHMARK_RESULTS_DIR}" ]]; then 2025-12-04T09:44:55.7552230Z  echo "${BENCHMARK_RESULTS_DIR} does not exist, skipping" 2025-12-04T09:44:55.7552592Z  # We don't want the job to fail if the directory doesn't exist 2025-12-04T09:44:55.7552865Z  exit 0 2025-12-04T09:44:55.7553016Z fi 2025-12-04T09:44:55.7553160Z  2025-12-04T09:44:55.7553327Z if [[ "${DRY_RUN}" == "true" ]]; then 2025-12-04T09:44:55.7553778Z  python3 "${GITHUB_ACTION_PATH}/../../scripts/upload_benchmark_results.py" \ 2025-12-04T09:44:55.7554172Z  --benchmark-results-dir "${BENCHMARK_RESULTS_DIR}" \ 2025-12-04T09:44:55.7554471Z  --metadata "${BENCHMARK_METADATA}" \ 2025-12-04T09:44:55.7554714Z  --runners "${RUNNER_INFO}" \ 2025-12-04T09:44:55.7554948Z  --dependencies "${DEPENDENCIES}" \ 2025-12-04T09:44:55.7555173Z  --dry-run 2025-12-04T09:44:55.7555341Z else 2025-12-04T09:44:55.7555628Z  python3 "${GITHUB_ACTION_PATH}/../../scripts/upload_benchmark_results.py" \ 2025-12-04T09:44:55.7556103Z  --benchmark-results-dir "${BENCHMARK_RESULTS_DIR}" \ 2025-12-04T09:44:55.7556400Z  --metadata "${BENCHMARK_METADATA}" \ 2025-12-04T09:44:55.7556637Z  --runners "${RUNNER_INFO}" \ 2025-12-04T09:44:55.7556871Z  --dependencies "${DEPENDENCIES}" 2025-12-04T09:44:55.7557093Z fi 2025-12-04T09:44:55.7564153Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:55.7564451Z env: 2025-12-04T09:44:55.7564605Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:55.7564797Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:55.7565022Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:55.7565426Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:55.7565791Z DEVICE_NAME: 2025-12-04T09:44:55.7565952Z DEVICE_TYPE: 2025-12-04T09:44:55.7566142Z BENCHMARK_RESULTS_DIR: test/test-reports 2025-12-04T09:44:55.7566373Z DRY_RUN: false 2025-12-04T09:44:55.7567457Z BENCHMARK_METADATA: {"timestamp": 1764841494, "schema_version": "v3", "name": "linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests)", "repo": "pytorch/pytorch", "head_branch": "refs/heads/main", "head_sha": "ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32", "workflow_id": 19922768520, "run_attempt": 1, "job_id": 57116084853} 2025-12-04T09:44:55.7568897Z RUNNER_INFO: [{"cpu_info": "x86_64", "cpu_count": 16, "avail_mem_in_gb": 60, "extra_info": {"hostname": "ip-10-0-34-254.ec2.internal"}, "name": "cuda", "type": "NVIDIA L4", "gpu_count": 1, "avail_gpu_mem_in_gb": 22}] 2025-12-04T09:44:55.7569444Z DEPENDENCIES: {} 2025-12-04T09:44:55.7569600Z ##[endgroup] 2025-12-04T09:44:55.7592805Z + [[ -n '' ]] 2025-12-04T09:44:55.7593014Z + [[ ! -d test/test-reports ]] 2025-12-04T09:44:55.7593225Z + [[ false == \t\r\u\e ]] 2025-12-04T09:44:55.7596371Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py --benchmark-results-dir test/test-reports --metadata '{"timestamp": 1764841494, "schema_version": "v3", "name": "linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests)", "repo": "pytorch/pytorch", "head_branch": "refs/heads/main", "head_sha": "ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32", "workflow_id": 19922768520, "run_attempt": 1, "job_id": 57116084853}' --runners '[{"cpu_info": "x86_64", "cpu_count": 16, "avail_mem_in_gb": 60, "extra_info": {"hostname": "ip-10-0-34-254.ec2.internal"}, "name": "cuda", "type": "NVIDIA L4", "gpu_count": 1, "avail_gpu_mem_in_gb": 22}]' --dependencies '{}' 2025-12-04T09:44:55.9032573Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'lazy/test_ts_opinfo'}], 'excluded': []} from test/test-reports/td_exclusions-42e63d8daa76a5d17144.json is not a benchmark record, skipping 2025-12-04T09:44:55.9034042Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9087668Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'inductor/test_aot_inductor'}, {'test_file': 'dynamo/test_dynamic_shapes'}, {'test_file': 'inductor/test_cudagraph_trees'}, {'test_file': 'dynamo/test_repros'}, {'test_file': 'dynamo/test_misc'}, {'test_file': 'inductor/test_flex_attention'}, {'test_file': 'inductor/test_cuda_select_algorithm'}, {'test_file': 'inductor/test_compile_subprocess'}, {'test_file': 'export/test_retraceability'}, {'test_file': 'dynamo/test_model_output'}, {'test_file': 'inductor/test_extension_backend'}, {'test_file': 'inductor/test_cooperative_reductions'}, {'test_file': 'dynamo/test_fx_graph_runnable'}, {'test_file': 'dynamo/test_backends'}, {'test_file': 'dynamo/test_functions'}, {'test_file': 'inductor/test_mix_order_reduction'}, {'test_file': 'inductor/test_padding'}, {'test_file': 'dynamo/test_dicts'}, {'test_file': 'dynamo/test_aot_compile'}, {'test_file': 'dynamo/test_sets'}, {'test_file': 'dynamo/test_callback'}, {'test_file': 'dynamo/test_fake_distributed'}, {'test_file': 'inductor/test_cudagraph_trees_expandable_segments'}, {'test_file': 'dynamo/test_exceptions'}, {'test_file': 'inductor/test_collective_autotuning'}, {'test_file': 'test_cuda'}, {'test_file': 'test_transformers'}, {'test_file': 'higher_order_ops/test_local_map'}, {'test_file': 'test_dataloader'}, {'test_file': 'test_decomp'}, {'test_file': 'test_ci_sanity_check_fail'}, {'test_file': 'test_cuda_expandable_segments'}, {'test_file': 'test_public_bindings'}, {'test_file': 'test_ops'}, {'test_file': 'functorch/test_dims'}, {'test_file': 'test_torchfuzz_repros'}, {'test_file': 'inductor/test_max_autotune'}, {'test_file': 'doctests'}, {'test_file': 'inductor/test_torchinductor'}, {'test_file': 'inductor/test_select_algorithm'}, {'test_file': 'inductor/test_torchinductor_dynamic_shapes'}, {'test_file': 'inductor/test_torchinductor_codegen_dynamic_shapes'}, {'test_file': 'inductor/test_kernel_benchmark'}, {'test_file': 'inductor/test_torchinductor_opinfo'}, {'test_file': 'inductor/test_pattern_matcher'}, {'test_file': 'inductor/test_group_batch_fusion'}, {'test_file': 'inductor/test_cpu_repro'}, {'test_file': 'inductor/test_cuda_repro'}, {'test_file': 'inductor/test_smoke'}, {'test_file': 'dynamo/test_after_aot'}, {'test_file': 'inductor/test_snode_runtime'}, {'test_file': 'inductor/test_minifier'}, {'test_file': 'inductor/test_compiled_autograd'}, {'test_file': 'inductor/test_custom_lowering'}, {'test_file': 'inductor/test_perf'}, {'test_file': 'inductor/test_fused_attention'}, {'test_file': 'dynamo/test_activation_checkpointing'}, {'test_file': 'inductor/test_binary_folding'}, {'test_file': 'inductor/test_mkldnn_pattern_matcher'}, {'test_file': 'dynamo/test_logging'}, {'test_file': 'inductor/test_inductor_freezing'}, {'test_file': 'inductor/test_layout_optim'}, {'test_file': 'dynamo/test_unspec'}, {'test_file': 'dynamo/test_higher_order_ops'}, {'test_file': 'inductor/test_mmdecomp'}, {'test_file': 'dynamo/test_ctx_manager'}, {'test_file': 'dynamo/test_exc'}, {'test_file': 'inductor/test_cpu_select_algorithm'}, {'test_file': 'inductor/test_aot_inductor_arrayref'}, {'test_file': 'inductor/test_cpu_cpp_wrapper'}, {'test_file': 'inductor/test_cutlass_backend'}, {'test_file': 'inductor/test_triton_cpu_backend'}, {'test_file': 'inductor/test_torchinductor_strided_blocks'}, {'test_file': 'test_custom_ops'}, {'test_file': 'test_content_store'}, {'test_file': 'inductor/test_halide'}, {'test_file': 'inductor/test_flex_decoding'}, {'test_file': 'inductor/test_deterministic'}, {'test_file': 'inductor/test_multi_kernel'}, {'test_file': 'inductor/test_analysis'}, {'test_file': 'inductor/test_pad_mm'}, {'test_file': 'inductor/test_triton_syntax'}, {'test_file': 'inductor/test_triton_extension_backend'}, {'test_file': 'test_sparse_semi_structured'}, {'test_file': 'inductor/test_op_completeness'}, {'test_file': 'inductor/test_subgraph_choice'}, {'test_file': 'inductor/test_b2b_gemm'}, {'test_file': 'inductor/test_triton_heuristics'}, {'test_file': 'inductor/test_cutedsl_grouped_mm'}, {'test_file': 'inductor/test_cpp_wrapper_hipify'}, {'test_file': 'inductor/test_ck_backend'}, {'test_file': 'inductor/test_inductor_utils'}, {'test_file': 'inductor/test_template_heuristics_registry'}, {'test_file': 'inductor/test_async_compile'}, {'test_file': 'inductor/test_gpu_cpp_wrapper'}, {'test_file': 'export/test_export_training_ir_to_run_decomp'}, {'test_file': 'dynamo/test_deque_reconstruct'}, {'test_file': 'inductor/test_utils'}, {'test_file': 'inductor/test_indexing'}, {'test_file': 'inductor/test_inductor_annotations'}, {'test_file': 'inductor/test_compile_worker'}, {'test_file': 'dynamo/test_einops'}, {'test_file': 'inductor/test_external_callables'}, {'test_file': 'test_testing'}, {'test_file': 'dynamo/test_fx_passes_pre_grad'}, {'test_file': 'inductor/test_fp8'}, {'test_file': 'inductor/test_autoheuristic'}, {'test_file': 'export/test_strict_export_v2'}, {'test_file': 'inductor/test_flex_flash'}, {'test_file': 'inductor/test_segmented_tree'}, {'test_file': 'inductor/test_kernel_optimization'}, {'test_file': 'inductor/test_metrics'}, {'test_file': 'export/test_unflatten_training_ir'}, {'test_file': 'inductor/test_triton_kernels'}, {'test_file': 'inductor/test_fx_fusion'}, {'test_file': 'inductor/test_native_matmul'}, {'test_file': 'inductor/test_xpu_basic'}, {'test_file': 'dynamo/test_inline_and_install'}, {'test_file': 'export/test_functionalized_assertions'}, {'test_file': 'inductor/test_loop_ordering'}, {'test_file': 'inductor/test_selective_lowering'}, {'test_file': 'dynamo/test_base_output'}, {'test_file': 'inductor/test_lookup_table'}, {'test_file': 'export/test_serialize'}, {'test_file': 'inductor/test_cutedsl_template'}, {'test_file': 'inductor/test_benchmark_fusion'}, {'test_file': 'inductor/test_inductor_scheduler'}, {'test_file': 'inductor/test_move_constructors_to_gpu'}, {'test_file': 'export/test_export_strict'}, {'test_file': 'dynamo/test_modules'}, {'test_file': 'export/test_serdes'}, {'test_file': 'dynamo/test_regional_inductor'}, {'test_file': 'inductor/test_remote_cache'}, {'test_file': 'inductor/test_coordinate_descent_tuner'}, {'test_file': 'inductor/test_inplace_padding'}, {'test_file': 'inductor/test_cudacodecache'}, {'test_file': 'inductor/test_minifier_utils'}, {'test_file': 'inductor/test_debug_trace'}, {'test_file': 'dynamo/test_recompiles'}, {'test_file': 'inductor/test_foreach'}, {'test_file': 'export/test_tree_utils'}, {'test_file': 'inductor/test_triton_wrapper'}, {'test_file': 'inductor/test_static_cuda_launcher'}, {'test_file': 'export/test_dynamic_shapes'}, {'test_file': 'dynamo/test_sdpa'}, {'test_file': 'inductor/test_aot_inductor_package'}, {'test_file': 'dynamo/test_utils'}, {'test_file': 'inductor/test_provenance_tracing'}, {'test_file': 'inductor/test_combo_kernels'}, {'test_file': 'inductor/test_codegen_triton'}, {'test_file': 'dynamo/test_frame_init'}, {'test_file': 'inductor/test_device_assert'}, {'test_file': 'dynamo/test_skip_non_tensor'}, {'test_file': 'dynamo/test_skip_guard_eval_unsafe'}, {'test_file': 'inductor/test_decompose_mem_bound_mm'}, {'test_file': 'dynamo/test_interop'}, {'test_file': 'functorch/test_eager_transforms'}, {'test_file': 'inductor/test_control_deps'}, {'test_file': 'inductor/test_benchmarking'}, {'test_file': 'inductor/test_helion_kernels'}, {'test_file': 'inductor/test_quantization'}, {'test_file': 'inductor/test_op_dtype_prop'}, {'test_file': 'inductor/test_best_config'}, {'test_file': 'export/test_tools'}, {'test_file': 'inductor/test_compiled_optimizers'}, {'test_file': 'dynamo/test_buffers_override'}, {'test_file': 'inductor/test_inplacing_pass'}, {'test_file': 'inductor/test_aot_inductor_custom_ops'}, {'test_file': 'inductor/test_split_cat_fx_passes'}, {'test_file': 'inductor/test_profiler'}, {'test_file': 'inductor/test_memory_planning'}, {'test_file': 'inductor/test_mem_estimation'}, {'test_file': 'dynamo/test_view'}, {'test_file': 'inductor/test_cutlass_evt'}, {'test_file': 'dynamo/test_reconstruct'}, {'test_file': 'dynamo/test_aot_autograd'}, {'test_file': 'export/test_cpp_serdes'}, {'test_file': 'inductor/test_cache'}, {'test_file': 'inductor/test_block_analysis'}, {'test_file': 'inductor/test_online_softmax'}, {'test_file': 'dynamo/test_subgraphs'}, {'test_file': 'dynamo/test_pre_dispatch'}, {'test_file': 'inductor/test_custom_post_grad_passes'}, {'test_file': 'dynamo/test_fx_annotate'}, {'test_file': 'dynamo/test_pgo'}, {'test_file': 'dynamo/test_config'}, {'test_file': 'dynamo/test_metrics_context'}, {'test_file': 'export/test_package'}, {'test_file': 'export/test_export_opinfo'}, {'test_file': 'dynamo/test_nops'}, {'test_file': 'inductor/test_graph_transform_observer'}, {'test_file': 'inductor/test_memory'}, {'test_file': 'inductor/test_aot_inductor_utils'}, {'test_file': 'export/test_db'}, {'test_file': 'dynamo/test_export_mutations'}, {'test_file': 'inductor/test_config'}, {'test_file': 'inductor/test_dependencies'}, {'test_file': 'inductor/test_fuzzer'}, {'test_file': 'dynamo/test_global'}, {'test_file': 'inductor/test_control_flow'}, {'test_file': 'dynamo/test_graph_region_tracker'}, {'test_file': 'dynamo/test_unittest'}, {'test_file': 'dynamo/test_streams'}, {'test_file': 'inductor/test_compile'}, {'test_file': 'inductor/test_unbacked_symints'}, {'test_file': 'inductor/test_scatter_optimization'}, {'test_file': 'inductor/test_ordered_set'}, {'test_file': 'inductor/test_pallas'}, {'test_file': 'dynamo/test_install_free_tensors'}, {'test_file': 'inductor/test_torchinductor_codegen_config_overrides'}, {'test_file': 'export/test_passes'}, {'test_file': 'dynamo/test_autograd_function'}, {'test_file': 'inductor/test_codecache'}, {'test_file': 'dynamo/test_cudagraphs'}, {'test_file': 'inductor/test_alignment'}, {'test_file': 'dynamo/test_profiler'}, {'test_file': 'dynamo/test_guard_serialization'}, {'test_file': 'dynamo/test_compile'}, {'test_file': 'dynamo/test_nested_graph_breaks'}, {'test_file': 'inductor/test_needs_exact_strides'}, {'test_file': 'inductor/test_auto_functionalize'}, {'test_file': 'inductor/test_split_cat_fx_aten_passes'}, {'test_file': 'inductor/test_minifier_isolate'}, {'test_file': 'dynamo/test_list'}, {'test_file': 'dynamo/test_resume'}, {'test_file': 'inductor/test_augmented_graph_helper'}, {'test_file': 'dynamo/test_deviceguard'}, {'test_file': 'dynamo/test_sources'}, {'test_file': 'dynamo/test_backward_higher_order_ops'}, {'test_file': 'dynamo/test_modes'}, {'test_file': 'dynamo/test_optimizers'}, {'test_file': 'export/test_torchbind'}, {'test_file': 'inductor/test_custom_partitioner_fn'}, {'test_file': 'dynamo/test_debug_utils'}, {'test_file': 'dynamo/test_base_hop'}, {'test_file': 'dynamo/test_export'}, {'test_file': 'dynamo/test_package'}, {'test_file': 'inductor/test_efficient_conv_bn_eval'}, {'test_file': 'inductor/test_torchbind'}, {'test_file': 'dynamo/test_python_dispatcher'}, {'test_file': 'export/test_swap'}, {'test_file': 'export/test_unflatten'}, {'test_file': 'dynamo/test_verify_correctness'}, {'test_file': 'dynamo/test_wrap_inductor_compiled_regions'}, {'test_file': 'inductor/test_fxir_backend'}, {'test_file': 'dynamo/test_cudagraphs_expandable_segments'}, {'test_file': 'inductor/test_caching'}, {'test_file': 'dynamo/test_aot_autograd_cache'}, {'test_file': 'dynamo/test_flat_apply'}, {'test_file': 'dynamo/test_input_attr_tracking'}, {'test_file': 'dynamo/test_compiler_bisector'}, {'test_file': 'dynamo/test_graph_deduplication'}, {'test_file': 'inductor/test_distributed_patterns'}, {'test_file': 'dynamo/test_structured_trace'}, {'test_file': 'dynamo/test_error_messages'}, {'test_file': 'dynamo/test_bytecode_utils'}, {'test_file': 'inductor/test_mps_basic'}, {'test_file': 'export/test_nativert'}, {'test_file': 'inductor/test_custom_op_autotune'}, {'test_file': 'export/test_hop'}, {'test_file': 'dynamo/test_tree_map'}, {'test_file': 'dynamo/test_minifier'}, {'test_file': 'dynamo/test_guard_manager'}, {'test_file': 'export/test_schema'}, {'test_file': 'dynamo/test_torchrec'}, {'test_file': 'export/test_pass_infra'}, {'test_file': 'dynamo/test_recompile_ux'}, {'test_file': 'export/test_experimental'}, {'test_file': 'export/test_converter'}, {'test_file': 'export/test_export'}, {'test_file': 'test_model_exports_to_core_aten'}, {'test_file': 'dynamo/test_precompile_context'}, {'test_file': 'dynamo/test_trace_rules'}, {'test_file': 'export/test_upgrader'}, {'test_file': 'dynamo/test_hooks'}, {'test_file': 'dynamo/test_reorder_logs'}, {'test_file': 'dynamo/test_subclasses'}, {'test_file': 'dynamo/test_generator'}, {'test_file': 'export/test_lift_unlift'}, {'test_file': 'dynamo/test_decorators'}, {'test_file': 'export/test_verifier'}, {'test_file': 'export/test_sparse'}, {'test_file': 'dynamo/test_python_autograd'}, {'test_file': 'export/test_draft_export'}, {'test_file': 'dynamo/test_comptime'}, {'test_file': 'test_sort_and_select'}, {'test_file': 'test_privateuseone_python_backend'}, {'test_file': 'functorch/test_rearrange'}, {'test_file': 'functorch/test_parsing'}, {'test_file': 'test_varlen_attention'}, {'test_file': 'test_package'}, {'test_file': 'profiler/test_profiler'}, {'test_file': 'test_mkl_verbose'}, {'test_file': 'test_comparison_utils'}, {'test_file': 'functorch/test_ac_logging'}, {'test_file': 'test_mkldnn_verbose'}, {'test_file': 'test_cpp_api_parity'}, {'test_file': 'test_utils_config_module'}, {'test_file': 'test_hop_infra'}, {'test_file': 'test_appending_byte_serializer'}, {'test_file': 'test_license'}, {'test_file': 'test_ao_sparsity'}, {'test_file': 'test_autoload'}, {'test_file': 'nn/attention/test_open_registry'}, {'test_file': 'xpu/test_fusion'}, {'test_file': 'test_as_strided'}, {'test_file': 'test_foreach'}, {'test_file': 'test_proxy_tensor'}, {'test_file': 'torch_np/test_binary_ufuncs'}, {'test_file': 'torch_np/test_unary_ufuncs'}, {'test_file': 'test_utils_filelock'}, {'test_file': 'test_extension_utils'}, {'test_file': 'test_rename_privateuse1_to_existing_device'}, {'test_file': 'nn/attention/test_fa4'}, {'test_file': 'typing/test_python_operators'}, {'test_file': 'test_functionalization'}, {'test_file': 'test_matmul_cuda'}, {'test_file': 'torch_np/test_dtype'}, {'test_file': 'test_file_check'}, {'test_file': 'profiler/test_kineto'}, {'test_file': 'test_flop_counter'}, {'test_file': 'backends/xeon/test_launch'}, {'test_file': 'test_show_pickle'}, {'test_file': 'test_openmp'}, {'test_file': 'test_expanded_weights'}, {'test_file': 'test_module_tracker'}, {'test_file': 'test_autograd'}, {'test_file': 'torch_np/numpy_tests/core/test_scalarinherit'}, {'test_file': 'test_tensorexpr_pybind'}, {'test_file': 'test_fx_experimental'}, {'test_file': 'functorch/test_ac_knapsack'}, {'test_file': 'torch_np/test_nep50_examples'}, {'test_file': 'test_torch'}, {'test_file': 'xpu/test_gemm'}, {'test_file': 'test_sparse'}, {'test_file': 'test_fx_passes'}, {'test_file': 'functorch/test_logging'}, {'test_file': 'test_namedtensor'}, {'test_file': 'test_tensorexpr'}, {'test_file': 'functorch/test_minifier'}, {'test_file': 'higher_order_ops/test_invoke_quant'}, {'test_file': 'torch_np/test_basic'}, {'test_file': 'test_jiterator'}, {'test_file': 'test_native_functions'}, {'test_file': 'test_typing'}, {'test_file': 'higher_order_ops/test_with_effects'}, {'test_file': 'higher_order_ops/test_invoke_subgraph'}, {'test_file': 'test_weak'}, {'test_file': 'test_complex'}, {'test_file': 'test_optim'}, {'test_file': 'lazy/test_functionalization'}, {'test_file': 'torch_np/test_random'}, {'test_file': 'nn/test_multihead_attention'}, {'test_file': 'test_legacy_vmap'}, {'test_file': 'lazy/test_bindings'}, {'test_file': 'xpu/test_conv'}, {'test_file': 'test_utils'}, {'test_file': 'test_pytree'}, {'test_file': 'test_namedtuple_return_api'}, {'test_file': 'profiler/test_record_function'}, {'test_file': 'test_compile_benchmark_util'}, {'test_file': 'test_set_default_mobile_cpu_allocator'}, {'test_file': 'test_fake_tensor'}, {'test_file': 'test_stateless'}, {'test_file': 'functorch/test_ac'}, {'test_file': 'test_binary_ufuncs'}, {'test_file': 'higher_order_ops/test_print'}, {'test_file': 'test_per_overload_api'}, {'test_file': 'torch_np/numpy_tests/core/test_einsum'}, {'test_file': 'test_multiprocessing'}, {'test_file': 'test_out_dtype_op'}, {'test_file': 'torch_np/test_ufuncs_basic'}, {'test_file': 'lazy/test_step_closures'}, {'test_file': 'functorch/dim/test_getsetitem'}, {'test_file': 'test_ops_fwd_gradients'}, {'test_file': 'test_meta'}, {'test_file': 'test_ops_jit'}, {'test_file': 'test_fx'}, {'test_file': 'test_numpy_interop'}, {'test_file': 'profiler/test_cpp_thread'}, {'test_file': 'test_ops_gradients'}, {'test_file': 'test_hub'}, {'test_file': 'test_segment_reductions'}, {'test_file': 'test_opaque_obj_v2'}, {'test_file': 'test_autograd_fallback'}, {'test_file': 'test_type_hints'}, {'test_file': 'functorch/test_aot_joint_with_descriptors'}, {'test_file': 'test_nestedtensor'}, {'test_file': 'test_linalg'}, {'test_file': 'test_functionalization_of_rng_ops'}, {'test_file': 'test_fx_reinplace_pass'}, {'test_file': 'functorch/test_control_flow'}, {'test_file': 'test_modules'}, {'test_file': 'nn/test_packed_sequence'}, {'test_file': 'test_numa_binding'}, {'test_file': 'test_pruning_op'}, {'test_file': 'test_jit_fuser_te'}, {'test_file': 'test_autocast'}, {'test_file': 'test_logging'}, {'test_file': 'test_python_dispatch'}, {'test_file': 'nn/test_lazy_modules'}, {'test_file': 'nn/test_pruning'}, {'test_file': 'test_monitor'}, {'test_file': 'test_cuda_sanitizer'}, {'test_file': 'test_bundled_inputs'}, {'test_file': 'torch_np/numpy_tests/core/test_numeric'}, {'test_file': 'torch_np/numpy_tests/core/test_multiarray'}, {'test_file': 'test_itt'}, {'test_file': 'torch_np/numpy_tests/lib/test_function_base'}, {'test_file': 'test_masked'}, {'test_file': 'test_sympy_utils'}, {'test_file': 'test_jit_disabled'}, {'test_file': 'test_subclass'}, {'test_file': 'test_import_stats'}, {'test_file': 'functorch/test_vmap_registrations'}, {'test_file': 'nn/test_parametrization'}, {'test_file': 'test_mkldnn_fusion'}, {'test_file': 'test_cpp_extensions_mtia_backend'}, {'test_file': 'lazy/test_ts_opinfo'}, {'test_file': 'test_dynamic_shapes'}, {'test_file': 'complex_tensor/test_complex_tensor'}, {'test_file': 'optim/test_lrscheduler'}, {'test_file': 'optim/test_swa_utils'}, {'test_file': 'cpp_extensions/python_agnostic_extension/test/test_python_agnostic'}, {'test_file': 'functorch/test_memory_efficient_fusion'}, {'test_file': 'torch_np/numpy_tests/lib/test_histograms'}, {'test_file': 'torch_np/test_indexing'}, {'test_file': 'test_schema_check'}, {'test_file': 'test_tensorboard'}, {'test_file': 'torch_np/numpy_tests/core/test_indexing'}, {'test_file': 'test_futures'}, {'test_file': 'test_tensor_creation_ops'}, {'test_file': 'nn/test_dropout'}, {'test_file': 'functorch/dim/test_split'}, {'test_file': 'torch_np/numpy_tests/lib/test_type_check'}, {'test_file': 'cpp_extensions/test_libtorch_agnostic'}, {'test_file': 'test_cpp_extensions_stream_and_event'}, {'test_file': 'profiler/test_execution_trace'}, {'test_file': 'test_jit'}, {'test_file': 'test_dispatch'}, {'test_file': 'test_datapipe'}, {'test_file': 'test_numba_integration'}, {'test_file': 'test_functional_optim'}, {'test_file': 'test_maskedtensor'}, {'test_file': 'benchmark_utils/test_benchmark_utils'}, {'test_file': 'torch_np/numpy_tests/core/test_scalarmath'}, {'test_file': 'test_scaled_matmul_cuda'}, {'test_file': 'torch_np/numpy_tests/core/test_shape_base'}, {'test_file': 'test_vulkan'}, {'test_file': 'lazy/test_generator'}, {'test_file': 'nn/test_convolution'}, {'test_file': 'torch_np/numpy_tests/linalg/test_linalg'}, {'test_file': 'torch_np/numpy_tests/core/test_dtype'}, {'test_file': 'lazy/test_debug_util'}, {'test_file': 'nn/test_load_state_dict'}, {'test_file': 'test_shape_ops'}, {'test_file': 'nn/test_module_hooks'}, {'test_file': 'torch_np/numpy_tests/lib/test_twodim_base'}, {'test_file': 'profiler/test_memory_profiler'}, {'test_file': 'test_jit_llga_fuser'}, {'test_file': 'test_serialization'}, {'test_file': 'test_sparse_csr'}, {'test_file': 'optim/test_optim'}, {'test_file': 'test_indexing'}, {'test_file': 'torch_np/numpy_tests/fft/test_pocketfft'}, {'test_file': 'functorch/test_ops'}, {'test_file': 'torch_np/numpy_tests/lib/test_shape_base_'}, {'test_file': 'test_cpp_extensions_jit'}, {'test_file': 'torch_np/numpy_tests/core/test_getlimits'}, {'test_file': 'torch_np/test_ndarray_methods'}, {'test_file': 'test_view_ops'}, {'test_file': 'test_type_info'}, {'test_file': 'functorch/test_aotdispatch'}, {'test_file': 'test_nn'}, {'test_file': 'torch_np/numpy_tests/core/test_dlpack'}, {'test_file': 'test_multiprocessing_spawn'}, {'test_file': 'test_scatter_gather_ops'}, {'test_file': 'test_cuda_multigpu'}, {'test_file': 'test_mkldnn'}, {'test_file': 'functorch/test_vmap'}, {'test_file': 'torch_np/numpy_tests/lib/test_index_tricks'}, {'test_file': 'test_jit_autocast'}, {'test_file': 'nn/test_pooling'}, {'test_file': 'nn/test_embedding'}, {'test_file': 'test_unary_ufuncs'}, {'test_file': 'test_xnnpack_integration'}, {'test_file': 'test_cuda_trace'}, {'test_file': 'test_native_mha'}, {'test_file': 'torch_np/numpy_tests/core/test_numerictypes'}, {'test_file': 'test_cuda_nvml_based_avail'}, {'test_file': 'test_function_schema'}, {'test_file': 'test_accelerator'}, {'test_file': 'nn/test_init'}, {'test_file': 'torch_np/numpy_tests/core/test_scalar_methods'}, {'test_file': 'torch_np/numpy_tests/fft/test_helper'}, {'test_file': 'test_mobile_optimizer'}, {'test_file': 'test_overrides'}, {'test_file': 'torch_np/test_function_base'}, {'test_file': 'test_type_promotion'}, {'test_file': 'torch_np/test_scalars_0D_arrays'}, {'test_file': 'test_cuda_primary_ctx'}, {'test_file': 'profiler/test_profiler_tree'}, {'test_file': 'torch_np/numpy_tests/lib/test_arraysetops'}, {'test_file': 'test_dlpack'}, {'test_file': 'profiler/test_torch_tidy'}, {'test_file': 'lazy/test_reuse_ir'}, {'test_file': 'test_functional_autograd_benchmark'}, {'test_file': 'test_reductions'}, {'test_file': 'torch_np/test_reductions'}, {'test_file': 'torch_np/numpy_tests/core/test_scalar_ctors'}, {'test_file': 'torch_np/numpy_tests/lib/test_arraypad'}, {'test_file': 'test_prims'}, {'test_file': 'test_spectral_ops'}, {'test_file': 'profiler/test_python_tracer'}, {'test_file': 'cpp_extensions/libtorch_agnostic_2_10_extension/test_version_compatibility'}, {'test_file': 'distributions/test_distributions'}, {'test_file': 'test_autoload_disable'}, {'test_file': 'test_autoload_enable'}, {'test_file': 'test_cpp_extensions_aot_ninja'}, {'test_file': 'test_cpp_extensions_aot_no_ninja'}], 'excluded': []} from test/test-reports/td_exclusions-09ea0e9477770fe813a8.json is not a benchmark record, skipping 2025-12-04T09:44:55.9138981Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9141655Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/Dict_test'}, {'test_file': 'cpp/Dimname_test'}, {'test_file': 'cpp/NamedTensor_test'}, {'test_file': 'cpp/apply_utils_test'}, {'test_file': 'cpp/atest'}, {'test_file': 'cpp/basic'}, {'test_file': 'cpp/broadcast_test'}, {'test_file': 'cpp/cpu_generator_test'}, {'test_file': 'cpp/dlconvertor_test'}, {'test_file': 'cpp/extension_backend_test'}, {'test_file': 'cpp/lazy_tensor_test'}, {'test_file': 'cpp/legacy_vmap_test'}, {'test_file': 'cpp/native_test'}, {'test_file': 'cpp/operators_test'}, {'test_file': 'cpp/scalar_tensor_test'}, {'test_file': 'cpp/scalar_test'}, {'test_file': 'cpp/tensor_iterator_test'}, {'test_file': 'cpp/undefined_tensor_test'}, {'test_file': 'cpp/wrapdim_test'}], 'excluded': []} from test/test-reports/td_exclusions-b3249e9941e0d6b0028b.json is not a benchmark record, skipping 2025-12-04T09:44:55.9144212Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9145338Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_generator_test'}], 'excluded': []} from test/test-reports/td_exclusions-50bcb7d3b58445670674.json is not a benchmark record, skipping 2025-12-04T09:44:55.9146443Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9147535Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_half_test'}], 'excluded': []} from test/test-reports/td_exclusions-b7d4f76f5a4da6a124dd.json is not a benchmark record, skipping 2025-12-04T09:44:55.9148687Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9149835Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_vectorized_test'}], 'excluded': []} from test/test-reports/td_exclusions-9de199c44fbd7c8c14c7.json is not a benchmark record, skipping 2025-12-04T09:44:55.9150986Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9152107Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_distributions_test'}], 'excluded': []} from test/test-reports/td_exclusions-5438364b0a4f7c0fec4c.json is not a benchmark record, skipping 2025-12-04T09:44:55.9153412Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9154514Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_optional_test'}], 'excluded': []} from test/test-reports/td_exclusions-015ceb7458bece78471f.json is not a benchmark record, skipping 2025-12-04T09:44:55.9155614Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9156705Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_complex_test'}], 'excluded': []} from test/test-reports/td_exclusions-f2062e6a83fa2f9a7bcf.json is not a benchmark record, skipping 2025-12-04T09:44:55.9157801Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9158916Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_complex_math_test'}], 'excluded': []} from test/test-reports/td_exclusions-fa84f3c9f4f630c6109f.json is not a benchmark record, skipping 2025-12-04T09:44:55.9160015Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9161222Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_cub_test'}], 'excluded': []} from test/test-reports/td_exclusions-91efe84b623ff402f7e1.json is not a benchmark record, skipping 2025-12-04T09:44:55.9162455Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9163561Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_atomic_ops_test'}], 'excluded': []} from test/test-reports/td_exclusions-af98daaffe2ee07e94c8.json is not a benchmark record, skipping 2025-12-04T09:44:55.9164659Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9165765Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_allocator_test'}], 'excluded': []} from test/test-reports/td_exclusions-508dee96a71dddf958fe.json is not a benchmark record, skipping 2025-12-04T09:44:55.9166856Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9168008Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/test_api'}], 'excluded': []} from test/test-reports/td_exclusions-09282199ca93334abad2.json is not a benchmark record, skipping 2025-12-04T09:44:55.9169066Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T09:44:55.9231773Z ##[group]Run cat test/**/*_toprint.log || true 2025-12-04T09:44:55.9232283Z cat test/**/*_toprint.log || true 2025-12-04T09:44:55.9241482Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:55.9241872Z env: 2025-12-04T09:44:55.9242037Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:55.9242228Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:55.9242457Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:55.9242878Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:55.9243253Z DEVICE_NAME: 2025-12-04T09:44:55.9243409Z DEVICE_TYPE: 2025-12-04T09:44:55.9243556Z ##[endgroup] 2025-12-04T09:44:55.9333736Z Test results will be stored in test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-05128a5df70d9533.xml 2025-12-04T09:44:55.9334403Z ============================= test session starts ============================== 2025-12-04T09:44:55.9334937Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:44:55.9335379Z cachedir: .pytest_cache 2025-12-04T09:44:55.9335921Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:44:55.9336752Z rootdir: /var/lib/jenkins/workspace 2025-12-04T09:44:55.9337152Z configfile: pytest.ini 2025-12-04T09:44:55.9337700Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:44:55.9338302Z collecting ... collected 2 items 2025-12-04T09:44:55.9338614Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:44:55.9338946Z Running 0 items in this shard: 2025-12-04T09:44:55.9339118Z 2025-12-04T09:44:55.9339649Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-05128a5df70d9533.xml - 2025-12-04T09:44:55.9340403Z ============================ no tests ran in 0.01s ============================= 2025-12-04T09:44:55.9431380Z ##[group]Run kill "$MONITOR_SCRIPT_PID" 2025-12-04T09:44:55.9431682Z kill "$MONITOR_SCRIPT_PID" 2025-12-04T09:44:55.9438438Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:55.9438712Z env: 2025-12-04T09:44:55.9438864Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:55.9439052Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:55.9439271Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:55.9439666Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:55.9440045Z DEVICE_NAME: 2025-12-04T09:44:55.9440208Z DEVICE_TYPE: 2025-12-04T09:44:55.9440373Z MONITOR_SCRIPT_PID: 58880 2025-12-04T09:44:55.9440562Z ##[endgroup] 2025-12-04T09:44:55.9468606Z /home/ec2-user/actions-runner/_work/_temp/04c0cf82-4444-4002-a3a4-2a7eb470a11c.sh: line 1: kill: (58880) - No such process 2025-12-04T09:44:55.9478541Z ##[error]Process completed with exit code 1. 2025-12-04T09:44:55.9579970Z Prepare all required actions 2025-12-04T09:44:55.9580326Z Getting action download info 2025-12-04T09:44:56.2525240Z Download action repository 'seemethere/upload-artifact-s3@v5' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-12-04T09:44:56.9156886Z Download action repository 'actions/upload-artifact@v4' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-12-04T09:44:58.8608509Z ##[group]Run ./.github/actions/upload-test-artifacts 2025-12-04T09:44:58.8608773Z with: 2025-12-04T09:44:58.8609206Z file-suffix: test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853 2025-12-04T09:44:58.8609583Z s3-bucket: gha-artifacts 2025-12-04T09:44:58.8609762Z env: 2025-12-04T09:44:58.8609911Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:58.8610092Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:58.8610318Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:58.8610719Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:58.8611107Z DEVICE_NAME: 2025-12-04T09:44:58.8611264Z DEVICE_TYPE: 2025-12-04T09:44:58.8611419Z ##[endgroup] 2025-12-04T09:44:58.8704165Z ##[group]Run # Remove any previous test jsons if they exist 2025-12-04T09:44:58.8704577Z # Remove any previous test jsons if they exist 2025-12-04T09:44:58.8704904Z rm -f test-jsons-*.zip 2025-12-04T09:44:58.8705276Z zip -r "test-jsons-${FILE_SUFFIX}.zip" test/test-reports -i '*.json' 2025-12-04T09:44:58.8713188Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:58.8713464Z env: 2025-12-04T09:44:58.8713621Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:58.8713809Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:58.8714035Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:58.8714448Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:58.8714809Z DEVICE_NAME: 2025-12-04T09:44:58.8714964Z DEVICE_TYPE: 2025-12-04T09:44:58.8715291Z FILE_SUFFIX: test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853 2025-12-04T09:44:58.8715654Z ##[endgroup] 2025-12-04T09:44:58.9581515Z adding: test/test-reports/td_exclusions-42e63d8daa76a5d17144.json (deflated 16%) 2025-12-04T09:44:58.9582314Z adding: test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-94fdd26a21107d16.json (stored 0%) 2025-12-04T09:44:58.9583187Z adding: test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-4a097beb92ace568.json (stored 0%) 2025-12-04T09:44:59.0417615Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8a3fc0e970c9a98f.json (deflated 98%) 2025-12-04T09:44:59.0418630Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-51eef588a0930662.json (stored 0%) 2025-12-04T09:44:59.0419613Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-af36cb9c6428bfea.json (stored 0%) 2025-12-04T09:44:59.0420552Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-37c6f4c91cf1b10c.json (stored 0%) 2025-12-04T09:44:59.0421502Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-0f855b036b683cf8.json (stored 0%) 2025-12-04T09:44:59.0451150Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-fba11c84b84fb152.json (deflated 99%) 2025-12-04T09:44:59.0452122Z adding: test/test-reports/python-pytest/dynamo.test_fake_distributed/dynamo.test_fake_distributed-23dd0c70b083b9a3.json (stored 0%) 2025-12-04T09:44:59.0453131Z adding: test/test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-05128a5df70d9533.json (stored 0%) 2025-12-04T09:44:59.0454232Z adding: test/test-reports/python-pytest/inductor.test_benchmark_fusion/inductor.test_benchmark_fusion-e609d0354be0d1fa.json (deflated 99%) 2025-12-04T09:44:59.0455565Z adding: test/test-reports/python-pytest/inductor.test_distributed_patterns/inductor.test_distributed_patterns-ec95bb2c31d278a1.json (stored 0%) 2025-12-04T09:44:59.0456506Z adding: test/test-reports/python-pytest/test_sort_and_select/test_sort_and_select-7d6d7bedd825d12a.json (stored 0%) 2025-12-04T09:44:59.0457260Z adding: test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-5b1698bc2f192c0d.json (stored 0%) 2025-12-04T09:44:59.0457985Z adding: test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-410fc5b37a9a9af7.json (stored 0%) 2025-12-04T09:44:59.0458708Z adding: test/test-reports/python-pytest/test_extension_utils/test_extension_utils-8ecaa1c21c0294f7.json (stored 0%) 2025-12-04T09:44:59.0459339Z adding: test/test-reports/python-pytest/test_show_pickle/test_show_pickle-f61e02e14ce79045.json (stored 0%) 2025-12-04T09:44:59.0459927Z adding: test/test-reports/python-pytest/test_torch/test_torch-86cbf7756d7c226b.json (deflated 98%) 2025-12-04T09:44:59.0460521Z adding: test/test-reports/python-pytest/test_tensorexpr/test_tensorexpr-d7faab3c9028401e.json (stored 0%) 2025-12-04T09:44:59.0462192Z adding: test/test-reports/python-pytest/test_utils/test_utils-d587f66af7496981.json (deflated 99%) 2025-12-04T09:44:59.0462854Z adding: test/test-reports/python-pytest/test_namedtuple_return_api/test_namedtuple_return_api-bfbca5bc9e064895.json (stored 0%) 2025-12-04T09:44:59.0463556Z adding: test/test-reports/python-pytest/test_fake_tensor/test_fake_tensor-1a668b992cb70372.json (deflated 95%) 2025-12-04T09:44:59.0466848Z adding: test/test-reports/python-pytest/test_multiprocessing/test_multiprocessing-d56d6df4e289f373.json (deflated 98%) 2025-12-04T09:44:59.0485042Z adding: test/test-reports/python-pytest/test_fx/test_fx-75f810e34c21fdae.json (deflated 99%) 2025-12-04T09:44:59.0485824Z adding: test/test-reports/python-pytest/test_autograd_fallback/test_autograd_fallback-af9a7a31daae2066.json (stored 0%) 2025-12-04T09:44:59.0486626Z adding: test/test-reports/python-pytest/test_autocast/test_autocast-6368d686b9d0a76a.json (stored 0%) 2025-12-04T09:44:59.0487402Z adding: test/test-reports/python-pytest/test_python_dispatch/test_python_dispatch-a00658362550dfa9.json (stored 0%) 2025-12-04T09:44:59.0504504Z adding: test/test-reports/python-pytest/test_jit_disabled/test_jit_disabled-577956ec3f5f835e.json (stored 0%) 2025-12-04T09:44:59.0505281Z adding: test/test-reports/python-pytest/test_cpp_extensions_mtia_backend/test_cpp_extensions_mtia_backend-5e73672e9ac01a41.json (stored 0%) 2025-12-04T09:44:59.0506178Z adding: test/test-reports/python-pytest/functorch.test_memory_efficient_fusion/functorch.test_memory_efficient_fusion-33238d8e909cf2e4.json (stored 0%) 2025-12-04T09:44:59.0506972Z adding: test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-7daf652f1a7b5488.json (stored 0%) 2025-12-04T09:44:59.0507748Z adding: test/test-reports/python-pytest/test_cpp_extensions_stream_and_event/test_cpp_extensions_stream_and_event-08f3a0dfb2065ef1.json (stored 0%) 2025-12-04T09:44:59.0508455Z adding: test/test-reports/python-pytest/test_dispatch/test_dispatch-0e2e8df624d84805.json (stored 0%) 2025-12-04T09:44:59.0509094Z adding: test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-80a8ed73694cb0b2.json (deflated 98%) 2025-12-04T09:44:59.0509762Z adding: test/test-reports/python-pytest/test_cpp_extensions_jit/test_cpp_extensions_jit-9b4bc2024c539956.json (stored 0%) 2025-12-04T09:44:59.0510365Z adding: test/test-reports/python-pytest/test_nn/test_nn-bad95e310b75086e.json (deflated 99%) 2025-12-04T09:44:59.0510992Z adding: test/test-reports/python-pytest/test_multiprocessing_spawn/test_multiprocessing_spawn-10678c829390d154.json (stored 0%) 2025-12-04T09:44:59.0511655Z adding: test/test-reports/python-pytest/nn.test_pooling/nn.test_pooling-81f3ed8b48e81d4f.json (stored 0%) 2025-12-04T09:44:59.0512446Z adding: test/test-reports/python-pytest/test_native_mha/test_native_mha-93061de1bb56c8a8.json (stored 0%) 2025-12-04T09:44:59.0513110Z adding: test/test-reports/python-pytest/test_mobile_optimizer/test_mobile_optimizer-f5ddc3c2ec167b4c.json (deflated 99%) 2025-12-04T09:44:59.0513763Z adding: test/test-reports/python-pytest/test_overrides/test_overrides-52ce9f408887fb01.json (deflated 99%) 2025-12-04T09:44:59.0514358Z adding: test/test-reports/python-pytest/test_reductions/test_reductions-1dbe12b386b7a769.json (stored 0%) 2025-12-04T09:44:59.0514955Z adding: test/test-reports/python-pytest/test_spectral_ops/test_spectral_ops-864e0cd8d7f784b9.json (stored 0%) 2025-12-04T09:44:59.0515760Z adding: test/test-reports/python-pytest/distributions.test_distributions/distributions.test_distributions-26c49ab0bf9a2a75.json (deflated 98%) 2025-12-04T09:44:59.0516569Z adding: test/test-reports/python-pytest/test_cpp_extensions_aot_ninja/test_cpp_extensions_aot_ninja-da282566f73e8ede.json (stored 0%) 2025-12-04T09:44:59.0517320Z adding: test/test-reports/python-pytest/test_cpp_extensions_aot_no_ninja/test_cpp_extensions_aot_no_ninja-6de3ce07151778a3.json (stored 0%) 2025-12-04T09:44:59.0518115Z adding: test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-e098322b0776271e.json (stored 0%) 2025-12-04T09:44:59.0519030Z adding: test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-51799f2bc7fe1d10.json (stored 0%) 2025-12-04T09:44:59.0519842Z adding: test/test-reports/python-pytest/inductor.test_autoheuristic/inductor.test_autoheuristic-886885d891812a32.json (stored 0%) 2025-12-04T09:44:59.0520785Z adding: test/test-reports/python-pytest/inductor.test_autoheuristic/inductor.test_autoheuristic-bad78e46bec01c6d.json (stored 0%) 2025-12-04T09:44:59.0521525Z adding: test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-949c20f2734be55d.json (stored 0%) 2025-12-04T09:44:59.0522251Z adding: test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-fb44f09fc737f7e3.json (stored 0%) 2025-12-04T09:44:59.0522943Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-eb1d57437e276b50.json (stored 0%) 2025-12-04T09:44:59.0523605Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-abc350caee946871.json (stored 0%) 2025-12-04T09:44:59.0524188Z adding: test/test-reports/python-pytest/test_meta/test_meta-7177ed662acb32c0.json (stored 0%) 2025-12-04T09:44:59.0524703Z adding: test/test-reports/python-pytest/test_meta/test_meta-183448c70bcf4d3b.json (stored 0%) 2025-12-04T09:44:59.0525214Z adding: test/test-reports/python-pytest/test_meta/test_meta-e4256c1c7d578c86.json (stored 0%) 2025-12-04T09:44:59.0525747Z adding: test/test-reports/python-pytest/test_meta/test_meta-8c2b92910b7a7bdc.json (stored 0%) 2025-12-04T09:44:59.0526376Z adding: test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-61cf112623d40fd3.json (stored 0%) 2025-12-04T09:44:59.0527108Z adding: test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-510382c7091b27bc.json (stored 0%) 2025-12-04T09:44:59.0527834Z adding: test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-4894c67067cce928.json (stored 0%) 2025-12-04T09:44:59.0528555Z adding: test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-d3ab6b26454b86ba.json (stored 0%) 2025-12-04T09:44:59.0529198Z adding: test/test-reports/python-pytest/test_modules/test_modules-3b26da1a0a5fa82d.json (stored 0%) 2025-12-04T09:44:59.0529760Z adding: test/test-reports/python-pytest/test_modules/test_modules-2f9f220a5fc72a18.json (stored 0%) 2025-12-04T09:44:59.0530334Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-360baa78a773af74.json (stored 0%) 2025-12-04T09:44:59.0530920Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-db92a6b1a0084256.json (stored 0%) 2025-12-04T09:44:59.0531631Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-ab58c0af20190c71.json (stored 0%) 2025-12-04T09:44:59.0532265Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-3d8b763c339d8b21.json (stored 0%) 2025-12-04T09:44:59.0532972Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-77e738022a2c01ca.json (stored 0%) 2025-12-04T09:44:59.0533591Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-8fe434a401eb0404.json (stored 0%) 2025-12-04T09:44:59.0534167Z adding: test/test-reports/td_exclusions-09ea0e9477770fe813a8.json (deflated 82%) 2025-12-04T09:44:59.0534778Z adding: test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204092927.json (deflated 37%) 2025-12-04T09:44:59.0535516Z adding: test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204093047.json (deflated 37%) 2025-12-04T09:44:59.0536114Z adding: test/test-reports/td_exclusions-b3249e9941e0d6b0028b.json (deflated 73%) 2025-12-04T09:44:59.0536547Z adding: test/test-reports/td_exclusions-50bcb7d3b58445670674.json (deflated 14%) 2025-12-04T09:44:59.0537021Z adding: test/test-reports/td_exclusions-b7d4f76f5a4da6a124dd.json (deflated 15%) 2025-12-04T09:44:59.0537462Z adding: test/test-reports/td_exclusions-9de199c44fbd7c8c14c7.json (deflated 14%) 2025-12-04T09:44:59.0537913Z adding: test/test-reports/td_exclusions-5438364b0a4f7c0fec4c.json (deflated 13%) 2025-12-04T09:44:59.0538358Z adding: test/test-reports/td_exclusions-015ceb7458bece78471f.json (deflated 14%) 2025-12-04T09:44:59.0538799Z adding: test/test-reports/td_exclusions-f2062e6a83fa2f9a7bcf.json (deflated 14%) 2025-12-04T09:44:59.0539234Z adding: test/test-reports/td_exclusions-fa84f3c9f4f630c6109f.json (deflated 13%) 2025-12-04T09:44:59.0539662Z adding: test/test-reports/td_exclusions-91efe84b623ff402f7e1.json (deflated 15%) 2025-12-04T09:44:59.0540100Z adding: test/test-reports/td_exclusions-af98daaffe2ee07e94c8.json (deflated 14%) 2025-12-04T09:44:59.0540537Z adding: test/test-reports/td_exclusions-508dee96a71dddf958fe.json (deflated 14%) 2025-12-04T09:44:59.0540968Z adding: test/test-reports/td_exclusions-09282199ca93334abad2.json (deflated 18%) 2025-12-04T09:44:59.0575249Z ##[group]Run # Remove any previous test reports if they exist 2025-12-04T09:44:59.0575666Z # Remove any previous test reports if they exist 2025-12-04T09:44:59.0575995Z rm -f test-reports-*.zip 2025-12-04T09:44:59.0576372Z zip -r "test-reports-${FILE_SUFFIX}.zip" test/test-reports -i '*.xml' -i '*.csv' 2025-12-04T09:44:59.0583399Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:59.0583669Z env: 2025-12-04T09:44:59.0583829Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:59.0584022Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:59.0584242Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:59.0584657Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:59.0585017Z DEVICE_NAME: 2025-12-04T09:44:59.0585178Z DEVICE_TYPE: 2025-12-04T09:44:59.0585480Z FILE_SUFFIX: test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853 2025-12-04T09:44:59.0585859Z ##[endgroup] 2025-12-04T09:44:59.0670323Z adding: test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-94fdd26a21107d16.xml (deflated 28%) 2025-12-04T09:44:59.0671214Z adding: test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-4a097beb92ace568.xml (deflated 28%) 2025-12-04T09:44:59.1489782Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8a3fc0e970c9a98f.xml (deflated 98%) 2025-12-04T09:44:59.1490794Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-51eef588a0930662.xml (deflated 28%) 2025-12-04T09:44:59.1491978Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-af36cb9c6428bfea.xml (deflated 28%) 2025-12-04T09:44:59.1493050Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-37c6f4c91cf1b10c.xml (deflated 28%) 2025-12-04T09:44:59.1494011Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-0f855b036b683cf8.xml (deflated 28%) 2025-12-04T09:44:59.1522711Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-fba11c84b84fb152.xml (deflated 99%) 2025-12-04T09:44:59.1523827Z adding: test/test-reports/python-pytest/dynamo.test_fake_distributed/dynamo.test_fake_distributed-23dd0c70b083b9a3.xml (deflated 28%) 2025-12-04T09:44:59.1524777Z adding: test/test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-05128a5df70d9533.xml (deflated 28%) 2025-12-04T09:44:59.1525710Z adding: test/test-reports/python-pytest/inductor.test_benchmark_fusion/inductor.test_benchmark_fusion-e609d0354be0d1fa.xml (deflated 98%) 2025-12-04T09:44:59.1526697Z adding: test/test-reports/python-pytest/inductor.test_distributed_patterns/inductor.test_distributed_patterns-ec95bb2c31d278a1.xml (deflated 28%) 2025-12-04T09:44:59.1527525Z adding: test/test-reports/python-pytest/test_sort_and_select/test_sort_and_select-7d6d7bedd825d12a.xml (deflated 28%) 2025-12-04T09:44:59.1528287Z adding: test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-5b1698bc2f192c0d.xml (deflated 28%) 2025-12-04T09:44:59.1529020Z adding: test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-410fc5b37a9a9af7.xml (deflated 28%) 2025-12-04T09:44:59.1529674Z adding: test/test-reports/python-pytest/test_extension_utils/test_extension_utils-8ecaa1c21c0294f7.xml (deflated 28%) 2025-12-04T09:44:59.1530310Z adding: test/test-reports/python-pytest/test_show_pickle/test_show_pickle-f61e02e14ce79045.xml (deflated 28%) 2025-12-04T09:44:59.1530916Z adding: test/test-reports/python-pytest/test_torch/test_torch-86cbf7756d7c226b.xml (deflated 97%) 2025-12-04T09:44:59.1531501Z adding: test/test-reports/python-pytest/test_tensorexpr/test_tensorexpr-d7faab3c9028401e.xml (deflated 28%) 2025-12-04T09:44:59.1532794Z adding: test/test-reports/python-pytest/test_utils/test_utils-d587f66af7496981.xml (deflated 99%) 2025-12-04T09:44:59.1533452Z adding: test/test-reports/python-pytest/test_namedtuple_return_api/test_namedtuple_return_api-bfbca5bc9e064895.xml (deflated 28%) 2025-12-04T09:44:59.1534165Z adding: test/test-reports/python-pytest/test_fake_tensor/test_fake_tensor-1a668b992cb70372.xml (deflated 91%) 2025-12-04T09:44:59.1536865Z adding: test/test-reports/python-pytest/test_multiprocessing/test_multiprocessing-d56d6df4e289f373.xml (deflated 98%) 2025-12-04T09:44:59.1554363Z adding: test/test-reports/python-pytest/test_fx/test_fx-75f810e34c21fdae.xml (deflated 99%) 2025-12-04T09:44:59.1555169Z adding: test/test-reports/python-pytest/test_autograd_fallback/test_autograd_fallback-af9a7a31daae2066.xml (deflated 28%) 2025-12-04T09:44:59.1555983Z adding: test/test-reports/python-pytest/test_autocast/test_autocast-6368d686b9d0a76a.xml (deflated 28%) 2025-12-04T09:44:59.1556760Z adding: test/test-reports/python-pytest/test_python_dispatch/test_python_dispatch-a00658362550dfa9.xml (deflated 28%) 2025-12-04T09:44:59.1557550Z adding: test/test-reports/python-pytest/test_jit_disabled/test_jit_disabled-577956ec3f5f835e.xml (deflated 28%) 2025-12-04T09:44:59.1558420Z adding: test/test-reports/python-pytest/test_cpp_extensions_mtia_backend/test_cpp_extensions_mtia_backend-5e73672e9ac01a41.xml (deflated 28%) 2025-12-04T09:44:59.1559469Z adding: test/test-reports/python-pytest/functorch.test_memory_efficient_fusion/functorch.test_memory_efficient_fusion-33238d8e909cf2e4.xml (deflated 28%) 2025-12-04T09:44:59.1560451Z adding: test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-7daf652f1a7b5488.xml (deflated 28%) 2025-12-04T09:44:59.1561532Z adding: test/test-reports/python-pytest/test_cpp_extensions_stream_and_event/test_cpp_extensions_stream_and_event-08f3a0dfb2065ef1.xml (deflated 28%) 2025-12-04T09:44:59.1562620Z adding: test/test-reports/python-pytest/test_dispatch/test_dispatch-0e2e8df624d84805.xml (deflated 28%) 2025-12-04T09:44:59.1563409Z adding: test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-80a8ed73694cb0b2.xml (deflated 98%) 2025-12-04T09:44:59.1564238Z adding: test/test-reports/python-pytest/test_cpp_extensions_jit/test_cpp_extensions_jit-9b4bc2024c539956.xml (deflated 28%) 2025-12-04T09:44:59.1565064Z adding: test/test-reports/python-pytest/test_nn/test_nn-bad95e310b75086e.xml (deflated 99%) 2025-12-04T09:44:59.1565829Z adding: test/test-reports/python-pytest/test_multiprocessing_spawn/test_multiprocessing_spawn-10678c829390d154.xml (deflated 28%) 2025-12-04T09:44:59.1566599Z adding: test/test-reports/python-pytest/nn.test_pooling/nn.test_pooling-81f3ed8b48e81d4f.xml (deflated 29%) 2025-12-04T09:44:59.1567192Z adding: test/test-reports/python-pytest/test_native_mha/test_native_mha-93061de1bb56c8a8.xml (deflated 28%) 2025-12-04T09:44:59.1567827Z adding: test/test-reports/python-pytest/test_mobile_optimizer/test_mobile_optimizer-f5ddc3c2ec167b4c.xml (deflated 99%) 2025-12-04T09:44:59.1570060Z adding: test/test-reports/python-pytest/test_overrides/test_overrides-52ce9f408887fb01.xml (deflated 99%) 2025-12-04T09:44:59.1570802Z adding: test/test-reports/python-pytest/test_reductions/test_reductions-1dbe12b386b7a769.xml (deflated 28%) 2025-12-04T09:44:59.1571443Z adding: test/test-reports/python-pytest/test_spectral_ops/test_spectral_ops-864e0cd8d7f784b9.xml (deflated 28%) 2025-12-04T09:44:59.1572664Z adding: test/test-reports/python-pytest/distributions.test_distributions/distributions.test_distributions-26c49ab0bf9a2a75.xml (deflated 98%) 2025-12-04T09:44:59.1573860Z adding: test/test-reports/python-pytest/test_cpp_extensions_aot_ninja/test_cpp_extensions_aot_ninja-da282566f73e8ede.xml (deflated 28%) 2025-12-04T09:44:59.1574984Z adding: test/test-reports/python-pytest/test_cpp_extensions_aot_no_ninja/test_cpp_extensions_aot_no_ninja-6de3ce07151778a3.xml (deflated 28%) 2025-12-04T09:44:59.1576233Z adding: test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-e098322b0776271e.xml (deflated 28%) 2025-12-04T09:44:59.1577542Z adding: test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-51799f2bc7fe1d10.xml (deflated 28%) 2025-12-04T09:44:59.1578756Z adding: test/test-reports/python-pytest/inductor.test_autoheuristic/inductor.test_autoheuristic-886885d891812a32.xml (deflated 28%) 2025-12-04T09:44:59.1579946Z adding: test/test-reports/python-pytest/inductor.test_autoheuristic/inductor.test_autoheuristic-bad78e46bec01c6d.xml (deflated 28%) 2025-12-04T09:44:59.1580920Z adding: test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-949c20f2734be55d.xml (deflated 28%) 2025-12-04T09:44:59.1581672Z adding: test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-fb44f09fc737f7e3.xml (deflated 28%) 2025-12-04T09:44:59.1582391Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-eb1d57437e276b50.xml (deflated 28%) 2025-12-04T09:44:59.1583075Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-abc350caee946871.xml (deflated 28%) 2025-12-04T09:44:59.1583680Z adding: test/test-reports/python-pytest/test_meta/test_meta-7177ed662acb32c0.xml (deflated 27%) 2025-12-04T09:44:59.1584206Z adding: test/test-reports/python-pytest/test_meta/test_meta-183448c70bcf4d3b.xml (deflated 27%) 2025-12-04T09:44:59.1584728Z adding: test/test-reports/python-pytest/test_meta/test_meta-e4256c1c7d578c86.xml (deflated 27%) 2025-12-04T09:44:59.1585310Z adding: test/test-reports/python-pytest/test_meta/test_meta-8c2b92910b7a7bdc.xml (deflated 28%) 2025-12-04T09:44:59.1586321Z adding: test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-61cf112623d40fd3.xml (deflated 28%) 2025-12-04T09:44:59.1587080Z adding: test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-510382c7091b27bc.xml (deflated 28%) 2025-12-04T09:44:59.1587828Z adding: test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-4894c67067cce928.xml (deflated 28%) 2025-12-04T09:44:59.1588571Z adding: test/test-reports/python-pytest/functorch.test_control_flow/functorch.test_control_flow-d3ab6b26454b86ba.xml (deflated 28%) 2025-12-04T09:44:59.1589271Z adding: test/test-reports/python-pytest/test_modules/test_modules-3b26da1a0a5fa82d.xml (deflated 28%) 2025-12-04T09:44:59.1589833Z adding: test/test-reports/python-pytest/test_modules/test_modules-2f9f220a5fc72a18.xml (deflated 28%) 2025-12-04T09:44:59.1590411Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-360baa78a773af74.xml (deflated 28%) 2025-12-04T09:44:59.1591003Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-db92a6b1a0084256.xml (deflated 28%) 2025-12-04T09:44:59.1591627Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-ab58c0af20190c71.xml (deflated 28%) 2025-12-04T09:44:59.1592303Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-3d8b763c339d8b21.xml (deflated 28%) 2025-12-04T09:44:59.1592938Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-77e738022a2c01ca.xml (deflated 29%) 2025-12-04T09:44:59.1593571Z adding: test/test-reports/python-pytest/functorch.test_ops/functorch.test_ops-8fe434a401eb0404.xml (deflated 28%) 2025-12-04T09:44:59.1594250Z adding: test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204092927.xml (deflated 42%) 2025-12-04T09:44:59.1594974Z adding: test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204093047.xml (deflated 42%) 2025-12-04T09:44:59.1621053Z ##[group]Run # Remove any previous usage logs if they exist 2025-12-04T09:44:59.1621375Z # Remove any previous usage logs if they exist 2025-12-04T09:44:59.1621638Z rm -f logs-*.zip 2025-12-04T09:44:59.1621884Z zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt' || true 2025-12-04T09:44:59.1622239Z zip -r "logs-${FILE_SUFFIX}.zip" test/test-reports -i '*.log' || true 2025-12-04T09:44:59.1629246Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:59.1629519Z env: 2025-12-04T09:44:59.1629688Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:59.1629871Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:59.1630098Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:59.1630501Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:59.1630861Z DEVICE_NAME: 2025-12-04T09:44:59.1631015Z DEVICE_TYPE: 2025-12-04T09:44:59.1631325Z FILE_SUFFIX: test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853 2025-12-04T09:44:59.1631678Z ##[endgroup] 2025-12-04T09:44:59.1700440Z adding: usage_log.txt (deflated 58%) 2025-12-04T09:44:59.1735574Z adding: test/test-reports/lazy.test_ts_opinfo_1.1_784637acdd02e0b4_.log (deflated 49%) 2025-12-04T09:44:59.1736337Z adding: test/test-reports/lazy.test_ts_opinfo_1.1_7a32b03ee4eba52c_.log (deflated 49%) 2025-12-04T09:44:59.2558172Z adding: test/test-reports/inductor.test_flex_attention_1.6_88265ee86f308f18_.log (deflated 98%) 2025-12-04T09:44:59.2559200Z adding: test/test-reports/inductor.test_flex_attention_2.6_4c127f8c0344b8ae_.log (deflated 50%) 2025-12-04T09:44:59.2560103Z adding: test/test-reports/inductor.test_flex_attention_3.6_df984f4b37c4889f_.log (deflated 50%) 2025-12-04T09:44:59.2560732Z adding: test/test-reports/inductor.test_flex_attention_4.6_453bb167963be69f_.log (deflated 50%) 2025-12-04T09:44:59.2561355Z adding: test/test-reports/inductor.test_flex_attention_5.6_691743a62e38d3bc_.log (deflated 50%) 2025-12-04T09:44:59.2562376Z adding: test/test-reports/inductor.test_flex_attention_6.6_878f54c09ed3a0a3_.log (deflated 96%) 2025-12-04T09:44:59.2563399Z adding: test/test-reports/dynamo.test_fake_distributed_1.1_2823f9f71dd63ef0_.log (deflated 50%) 2025-12-04T09:44:59.2564249Z adding: test/test-reports/test_ci_sanity_check_fail_1.1_umbpx4lh_toprint.log (deflated 50%) 2025-12-04T09:44:59.2564870Z adding: test/test-reports/inductor.test_max_autotune_1.1_6b81222e05fc6886_.log (deflated 35%) 2025-12-04T09:44:59.2565502Z adding: test/test-reports/inductor.test_cutlass_backend_1.1_b129286a599ec374_.log (deflated 33%) 2025-12-04T09:44:59.2566532Z adding: test/test-reports/inductor.test_benchmark_fusion_1.1_21e3a2205906c0e4_.log (deflated 97%) 2025-12-04T09:44:59.2567276Z adding: test/test-reports/inductor.test_distributed_patterns_1.1_37bc78cbf755a342_.log (deflated 51%) 2025-12-04T09:44:59.2567922Z adding: test/test-reports/test_sort_and_select_1.1_be558fc611a376c0_.log (deflated 49%) 2025-12-04T09:44:59.2568664Z adding: test/test-reports/test_privateuseone_python_backend_1.1_67f192d2bb258afb_.log (deflated 51%) 2025-12-04T09:44:59.2569181Z adding: test/test-reports/test_cpp_api_parity_1.1_e38423beb940dabc_.log (deflated 49%) 2025-12-04T09:44:59.2569726Z adding: test/test-reports/test_extension_utils_1.1_6f448e6667adee0f_.log (deflated 49%) 2025-12-04T09:44:59.2570182Z adding: test/test-reports/test_show_pickle_1.1_c585acf62ef1069a_.log (deflated 49%) 2025-12-04T09:44:59.2571900Z adding: test/test-reports/test_torch_1.1_b2685fecb603b132_.log (deflated 96%) 2025-12-04T09:44:59.2572590Z adding: test/test-reports/test_tensorexpr_1.1_8f889bbf725904b3_.log (deflated 48%) 2025-12-04T09:44:59.2574028Z adding: test/test-reports/test_utils_1.1_54bd9f1f48dc35ff_.log (deflated 96%) 2025-12-04T09:44:59.2574642Z adding: test/test-reports/test_namedtuple_return_api_1.1_e16a2c1bff68b907_.log (deflated 50%) 2025-12-04T09:44:59.2575612Z adding: test/test-reports/test_fake_tensor_1.1_6cc5e0685466fadf_.log (deflated 90%) 2025-12-04T09:44:59.2577337Z adding: test/test-reports/test_multiprocessing_1.1_79160357459cb2c2_.log (deflated 95%) 2025-12-04T09:44:59.2578469Z adding: test/test-reports/test_fx_1.1_c23d985f9096efa2_.log (deflated 94%) 2025-12-04T09:44:59.2579095Z adding: test/test-reports/test_autograd_fallback_1.1_d1fd30600f5e8389_.log (deflated 49%) 2025-12-04T09:44:59.2579618Z adding: test/test-reports/test_autocast_1.1_8f25c3d25ce2a5fe_.log (deflated 48%) 2025-12-04T09:44:59.2580085Z adding: test/test-reports/test_python_dispatch_1.1_6d2023546169b768_.log (deflated 49%) 2025-12-04T09:44:59.2580572Z adding: test/test-reports/test_jit_disabled_1.1_f4341100659034b7_.log (deflated 48%) 2025-12-04T09:44:59.2581058Z adding: test/test-reports/test_cpp_extensions_mtia_backend_1.1_592f44b55155dce1_.log (deflated 51%) 2025-12-04T09:44:59.2581620Z adding: test/test-reports/functorch.test_memory_efficient_fusion_1.1_cd53c92f6cd6db14_.log (deflated 51%) 2025-12-04T09:44:59.2582154Z adding: test/test-reports/test_tensor_creation_ops_1.1_cdc54e89ec4eb471_.log (deflated 50%) 2025-12-04T09:44:59.2582679Z adding: test/test-reports/test_cpp_extensions_stream_and_event_1.1_71f2127b80ebb7e0_.log (deflated 51%) 2025-12-04T09:44:59.2583172Z adding: test/test-reports/test_dispatch_1.1_618c0ce137115f87_.log (deflated 48%) 2025-12-04T09:44:59.2583715Z adding: test/test-reports/nn.test_convolution_1.1_6bed80e83e42a622_.log (deflated 95%) 2025-12-04T09:44:59.2584184Z adding: test/test-reports/test_cpp_extensions_jit_1.1_7536298b4447f40e_.log (deflated 50%) 2025-12-04T09:44:59.2586653Z adding: test/test-reports/test_nn_1.1_23aeac1c3d14f337_.log (deflated 97%) 2025-12-04T09:44:59.2587141Z adding: test/test-reports/test_multiprocessing_spawn_1.1_7b3cc8d6efb8b7a0_.log (deflated 50%) 2025-12-04T09:44:59.2587631Z adding: test/test-reports/nn.test_pooling_1.1_f8fdb3392a1afe99_.log (deflated 48%) 2025-12-04T09:44:59.2588058Z adding: test/test-reports/test_cuda_trace_1.1_dd993c3a642ff88e_.log (stored 0%) 2025-12-04T09:44:59.2588608Z adding: test/test-reports/test_native_mha_1.1_da5e8954db1fb20b_.log (deflated 48%) 2025-12-04T09:44:59.2589073Z adding: test/test-reports/test_cuda_nvml_based_avail_1.1_b3fe187c64b13f60_.log (stored 0%) 2025-12-04T09:44:59.2590549Z adding: test/test-reports/test_mobile_optimizer_1.1_6bfbdcc4fadadd51_.log (deflated 97%) 2025-12-04T09:44:59.2593797Z adding: test/test-reports/test_overrides_1.1_cccd5c265cf6e4d2_.log (deflated 98%) 2025-12-04T09:44:59.2594249Z adding: test/test-reports/test_cuda_primary_ctx_1.1_41fde96bae32b26e_.log (stored 0%) 2025-12-04T09:44:59.2594779Z adding: test/test-reports/test_reductions_1.1_9ab6c8b8ff1ac059_.log (deflated 48%) 2025-12-04T09:44:59.2595224Z adding: test/test-reports/test_spectral_ops_1.1_1bcbe1253863dd88_.log (deflated 48%) 2025-12-04T09:44:59.2596069Z adding: test/test-reports/distributions.test_distributions_1.1_5253b86f7af9d941_.log (deflated 95%) 2025-12-04T09:44:59.2596618Z adding: test/test-reports/test_cpp_extensions_aot_ninja_1.1_56e7b64c9caaf566_.log (deflated 50%) 2025-12-04T09:44:59.2597141Z adding: test/test-reports/test_cpp_extensions_aot_no_ninja_1.1_7f0ee8713838d52a_.log (deflated 51%) 2025-12-04T09:44:59.2597731Z adding: test/test-reports/inductor.test_collective_autotuning_1.1_b691007e7842981d_.log (deflated 52%) 2025-12-04T09:44:59.2598261Z adding: test/test-reports/inductor.test_cpu_cpp_wrapper_1.1_a9797ad9104e521b_.log (deflated 6%) 2025-12-04T09:44:59.2598767Z adding: test/test-reports/inductor.test_autoheuristic_1.1_56a71428e4aab3c0_.log (deflated 50%) 2025-12-04T09:44:59.2599284Z adding: test/test-reports/inductor.test_control_flow_5.5_440a0d8507c2b373_.log (deflated 50%) 2025-12-04T09:44:59.2599798Z adding: test/test-reports/inductor.test_mps_basic_1.1_eecb82439a0d6bc2_.log (deflated 6%) 2025-12-04T09:44:59.2600270Z adding: test/test-reports/test_ops_fwd_gradients_3.3_a7026342df501131_.log (deflated 50%) 2025-12-04T09:44:59.2600702Z adding: test/test-reports/test_meta_1.4_9f7909e7b7fd2ce6_.log (deflated 49%) 2025-12-04T09:44:59.2601111Z adding: test/test-reports/test_meta_2.4_db9a1888eef284c9_.log (deflated 49%) 2025-12-04T09:44:59.2601580Z adding: test/test-reports/functorch.test_control_flow_2.4_6e5eff350cb47256_.log (deflated 50%) 2025-12-04T09:44:59.2602086Z adding: test/test-reports/functorch.test_control_flow_3.4_d5eb7fcdd6d69766_.log (deflated 50%) 2025-12-04T09:44:59.2602544Z adding: test/test-reports/test_modules_1.2_2228adb73596c4fc_.log (deflated 48%) 2025-12-04T09:44:59.2602991Z adding: test/test-reports/optim.test_lrscheduler_1.1_3074b4c960c21b1e_.log (deflated 7%) 2025-12-04T09:44:59.2603446Z adding: test/test-reports/test_sparse_csr_3.3_ff7431816ae40a69_.log (deflated 49%) 2025-12-04T09:44:59.2603895Z adding: test/test-reports/functorch.test_ops_4.9_907fae480dda49a3_.log (deflated 49%) 2025-12-04T09:44:59.2604345Z adding: test/test-reports/functorch.test_ops_7.9_8b11ba6dc2d21632_.log (deflated 49%) 2025-12-04T09:44:59.2604858Z adding: test/test-reports/inductor.test_collective_autotuning_1.1_aae30afac9f5c6aa_.log (deflated 51%) 2025-12-04T09:44:59.2605398Z adding: test/test-reports/inductor.test_autoheuristic_1.1_883c8c6c354bdbf8_.log (deflated 50%) 2025-12-04T09:44:59.2605907Z adding: test/test-reports/inductor.test_cpu_cpp_wrapper_1.1_954f49cf43186d21_.log (deflated 6%) 2025-12-04T09:44:59.2606407Z adding: test/test-reports/inductor.test_control_flow_5.5_61e6431b9f594de0_.log (deflated 50%) 2025-12-04T09:44:59.2606890Z adding: test/test-reports/inductor.test_mps_basic_1.1_de0965e0624bbdab_.log (deflated 6%) 2025-12-04T09:44:59.2607365Z adding: test/test-reports/test_ops_fwd_gradients_3.3_d155f918101e7d63_.log (deflated 49%) 2025-12-04T09:44:59.2607839Z adding: test/test-reports/functorch.test_control_flow_2.4_b939fd6985e209a4_.log (deflated 50%) 2025-12-04T09:44:59.2608336Z adding: test/test-reports/functorch.test_control_flow_3.4_dfdac7e5b1bc5b13_.log (deflated 50%) 2025-12-04T09:44:59.2608795Z adding: test/test-reports/test_modules_1.2_a952f73d153232c2_.log (deflated 48%) 2025-12-04T09:44:59.2609320Z adding: test/test-reports/optim.test_lrscheduler_1.1_2fa908785b6f7913_.log (deflated 7%) 2025-12-04T09:44:59.2609769Z adding: test/test-reports/test_sparse_csr_3.3_16435fa6adc55229_.log (deflated 48%) 2025-12-04T09:44:59.2610199Z adding: test/test-reports/test_meta_1.4_8e318a314d876422_.log (deflated 49%) 2025-12-04T09:44:59.2610601Z adding: test/test-reports/test_meta_2.4_3fceec0bce921fc9_.log (deflated 49%) 2025-12-04T09:44:59.2611025Z adding: test/test-reports/functorch.test_ops_4.9_0564c349a6821973_.log (deflated 49%) 2025-12-04T09:44:59.2611515Z adding: test/test-reports/functorch.test_ops_7.9_0314f84c3490751a_.log (deflated 49%) 2025-12-04T09:44:59.2649412Z ##[group]Run # Remove any previous debugging artifacts if they exist 2025-12-04T09:44:59.2649769Z # Remove any previous debugging artifacts if they exist 2025-12-04T09:44:59.2650050Z rm -f debug-*.zip 2025-12-04T09:44:59.2650253Z if [ -d 'test/debug' ]; then 2025-12-04T09:44:59.2650517Z  zip -r "debug-${FILE_SUFFIX}.zip" test/debug 2025-12-04T09:44:59.2650750Z fi 2025-12-04T09:44:59.2657521Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:44:59.2657871Z env: 2025-12-04T09:44:59.2658027Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:59.2658221Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:59.2658450Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:59.2658846Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:59.2659211Z DEVICE_NAME: 2025-12-04T09:44:59.2659371Z DEVICE_TYPE: 2025-12-04T09:44:59.2659678Z FILE_SUFFIX: test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853 2025-12-04T09:44:59.2660029Z ##[endgroup] 2025-12-04T09:44:59.2747731Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T09:44:59.2747959Z with: 2025-12-04T09:44:59.2748119Z s3-bucket: gha-artifacts 2025-12-04T09:44:59.2748356Z s3-prefix: pytorch/pytorch/19922768520/1/artifact 2025-12-04T09:44:59.2748595Z retention-days: 14 2025-12-04T09:44:59.2748768Z if-no-files-found: warn 2025-12-04T09:44:59.2748970Z path: test-jsons-*.zip 2025-12-04T09:44:59.2749153Z name: artifact 2025-12-04T09:44:59.2749309Z region: us-east-1 2025-12-04T09:44:59.2749464Z env: 2025-12-04T09:44:59.2749612Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:44:59.2749787Z HAS_NVIDIA_GPU: true 2025-12-04T09:44:59.2750007Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:44:59.2750411Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:44:59.2750766Z DEVICE_NAME: 2025-12-04T09:44:59.2750934Z DEVICE_TYPE: 2025-12-04T09:44:59.2751088Z ##[endgroup] 2025-12-04T09:44:59.8668644Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T09:44:59.8669094Z With the provided path, there will be 1 file uploaded 2025-12-04T09:44:59.8669522Z Uploading to s3 prefix: pytorch/pytorch/19922768520/1/artifact 2025-12-04T09:44:59.8722459Z Starting upload of test-jsons-test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853.zip 2025-12-04T09:45:00.0639884Z Finished upload of test-jsons-test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853.zip 2025-12-04T09:45:00.0880810Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T09:45:00.0881042Z with: 2025-12-04T09:45:00.0881203Z s3-bucket: gha-artifacts 2025-12-04T09:45:00.0881432Z s3-prefix: pytorch/pytorch/19922768520/1/artifact 2025-12-04T09:45:00.0881676Z retention-days: 14 2025-12-04T09:45:00.0881859Z if-no-files-found: error 2025-12-04T09:45:00.0882049Z path: test-reports-*.zip 2025-12-04T09:45:00.0882230Z name: artifact 2025-12-04T09:45:00.0882389Z region: us-east-1 2025-12-04T09:45:00.0882554Z env: 2025-12-04T09:45:00.0882709Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:00.0882890Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:00.0883120Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:00.0883703Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:00.0884070Z DEVICE_NAME: 2025-12-04T09:45:00.0884228Z DEVICE_TYPE: 2025-12-04T09:45:00.0884388Z ##[endgroup] 2025-12-04T09:45:00.3807485Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T09:45:00.3808120Z With the provided path, there will be 1 file uploaded 2025-12-04T09:45:00.3808720Z Uploading to s3 prefix: pytorch/pytorch/19922768520/1/artifact 2025-12-04T09:45:00.3861161Z Starting upload of test-reports-test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853.zip 2025-12-04T09:45:00.5423029Z Finished upload of test-reports-test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853.zip 2025-12-04T09:45:00.5671652Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T09:45:00.5671904Z with: 2025-12-04T09:45:00.5672073Z s3-bucket: gha-artifacts 2025-12-04T09:45:00.5672309Z s3-prefix: pytorch/pytorch/19922768520/1/artifact 2025-12-04T09:45:00.5672574Z retention-days: 14 2025-12-04T09:45:00.5672759Z if-no-files-found: ignore 2025-12-04T09:45:00.5672945Z path: logs-*.zip 2025-12-04T09:45:00.5673114Z name: artifact 2025-12-04T09:45:00.5673396Z region: us-east-1 2025-12-04T09:45:00.5673552Z env: 2025-12-04T09:45:00.5673704Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:00.5673906Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:00.5674132Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:00.5674538Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:00.5674907Z DEVICE_NAME: 2025-12-04T09:45:00.5675065Z DEVICE_TYPE: 2025-12-04T09:45:00.5675212Z ##[endgroup] 2025-12-04T09:45:00.8602217Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T09:45:00.8602742Z With the provided path, there will be 1 file uploaded 2025-12-04T09:45:00.8603151Z Uploading to s3 prefix: pytorch/pytorch/19922768520/1/artifact 2025-12-04T09:45:00.8655527Z Starting upload of logs-test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853.zip 2025-12-04T09:45:00.9828669Z Finished upload of logs-test-default-1-5-lf.linux.g6.4xlarge.experimental.nvidia.gpu_57116084853.zip 2025-12-04T09:45:01.0075783Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T09:45:01.0076018Z with: 2025-12-04T09:45:01.0076175Z s3-bucket: gha-artifacts 2025-12-04T09:45:01.0076405Z s3-prefix: pytorch/pytorch/19922768520/1/artifact 2025-12-04T09:45:01.0076656Z retention-days: 14 2025-12-04T09:45:01.0076831Z if-no-files-found: ignore 2025-12-04T09:45:01.0077032Z path: debug-*.zip 2025-12-04T09:45:01.0077211Z name: artifact 2025-12-04T09:45:01.0077365Z region: us-east-1 2025-12-04T09:45:01.0077523Z env: 2025-12-04T09:45:01.0077671Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:01.0077852Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:01.0078077Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:01.0078500Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:01.0078859Z DEVICE_NAME: 2025-12-04T09:45:01.0079015Z DEVICE_TYPE: 2025-12-04T09:45:01.0079173Z ##[endgroup] 2025-12-04T09:45:01.2971980Z No files were found with the provided path: debug-*.zip. No artifacts will be uploaded. 2025-12-04T09:45:01.3242624Z ##[group]Run # shellcheck disable=SC2156 2025-12-04T09:45:01.3242954Z # shellcheck disable=SC2156 2025-12-04T09:45:01.3243372Z find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \; 2025-12-04T09:45:01.3252208Z shell: /usr/bin/bash -e {0} 2025-12-04T09:45:01.3252515Z env: 2025-12-04T09:45:01.3252672Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:01.3252865Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:01.3253089Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:01.3253497Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:01.3253857Z DEVICE_NAME: 2025-12-04T09:45:01.3254011Z DEVICE_TYPE: 2025-12-04T09:45:01.3254167Z ##[endgroup] 2025-12-04T09:45:01.6647194Z Prepare all required actions 2025-12-04T09:45:01.6647554Z Getting action download info 2025-12-04T09:45:01.8445982Z Download action repository 'actions/setup-python@v6' (SHA:83679a892e2d95755f2dac6acb0bfd1e9ac5d548) 2025-12-04T09:45:03.2990865Z ##[group]Run ./.github/actions/upload-utilization-stats 2025-12-04T09:45:03.2991125Z with: 2025-12-04T09:45:03.2991289Z job_id: 57116084853 2025-12-04T09:45:03.2991730Z job_name: linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu, rerun_disabled_tests) 2025-12-04T09:45:03.2992311Z workflow_name: trunk 2025-12-04T09:45:03.2992489Z workflow_run_id: 19922768520 2025-12-04T09:45:03.2992680Z workflow_attempt: 1 2025-12-04T09:45:03.2992842Z env: 2025-12-04T09:45:03.2992996Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:03.2993185Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:03.2993403Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:03.2993829Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:03.2994178Z DEVICE_NAME: 2025-12-04T09:45:03.2994417Z DEVICE_TYPE: 2025-12-04T09:45:03.2994571Z ##[endgroup] 2025-12-04T09:45:03.3098971Z ##[group]Run actions/setup-python@v6 2025-12-04T09:45:03.3099189Z with: 2025-12-04T09:45:03.3099352Z python-version: 3.10 2025-12-04T09:45:03.3099541Z check-latest: false 2025-12-04T09:45:03.3099790Z token: *** 2025-12-04T09:45:03.3099956Z update-environment: true 2025-12-04T09:45:03.3100155Z allow-prereleases: false 2025-12-04T09:45:03.3100343Z freethreaded: false 2025-12-04T09:45:03.3100507Z env: 2025-12-04T09:45:03.3100648Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:03.3100838Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:03.3101061Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:03.3101465Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:03.3101826Z DEVICE_NAME: 2025-12-04T09:45:03.3101993Z DEVICE_TYPE: 2025-12-04T09:45:03.3102140Z ##[endgroup] 2025-12-04T09:45:03.6213792Z ##[group]Installed versions 2025-12-04T09:45:03.6221975Z Version 3.10 was not found in the local cache 2025-12-04T09:45:03.6365123Z (node:99910) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2025-12-04T09:45:03.6365845Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-12-04T09:45:04.0564511Z ##[error]The version '3.10' with architecture 'x64' was not found for this operating system. The list of all available versions can be found here: https://raw.githubusercontent.com/actions/python-versions/main/versions-manifest.json 2025-12-04T09:45:04.0734010Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-12-04T09:45:04.0734365Z with: 2025-12-04T09:45:04.0734518Z env: 2025-12-04T09:45:04.0734677Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:04.0734861Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:04.0735094Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:04.0735502Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:04.0735864Z DEVICE_NAME: 2025-12-04T09:45:04.0736019Z DEVICE_TYPE: 2025-12-04T09:45:04.0736175Z ##[endgroup] 2025-12-04T09:45:04.0851750Z ##[group]Run set -eou pipefail 2025-12-04T09:45:04.0851999Z set -eou pipefail 2025-12-04T09:45:04.0852264Z  2025-12-04T09:45:04.0852543Z echo "Holding runner for 2 hours until all ssh sessions have logged out" 2025-12-04T09:45:04.0852877Z for _ in $(seq 1440); do 2025-12-04T09:45:04.0853135Z  # Break if no ssh session exists anymore 2025-12-04T09:45:04.0853384Z  if [ "$(who)" = "" ]; then 2025-12-04T09:45:04.0853596Z  break 2025-12-04T09:45:04.0853750Z  fi 2025-12-04T09:45:04.0853919Z  echo "." 2025-12-04T09:45:04.0854090Z  sleep 5 2025-12-04T09:45:04.0854248Z done 2025-12-04T09:45:04.0862719Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:04.0863011Z env: 2025-12-04T09:45:04.0863166Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:04.0863359Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:04.0863583Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:04.0863986Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:04.0864339Z DEVICE_NAME: 2025-12-04T09:45:04.0864602Z DEVICE_TYPE: 2025-12-04T09:45:04.0864760Z ##[endgroup] 2025-12-04T09:45:04.0891592Z Holding runner for 2 hours until all ssh sessions have logged out 2025-12-04T09:45:04.1371906Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T09:45:04.1372421Z # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T09:45:04.1372731Z # shellcheck disable=SC2046 2025-12-04T09:45:04.1372971Z docker stop $(docker ps -q) || true 2025-12-04T09:45:04.1373222Z # Prune all of the docker images 2025-12-04T09:45:04.1373446Z docker system prune -af 2025-12-04T09:45:04.1380578Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:04.1380852Z env: 2025-12-04T09:45:04.1381010Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:04.1381194Z HAS_NVIDIA_GPU: true 2025-12-04T09:45:04.1381416Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:45:04.1381839Z DOCKER_CONTAINER_ID: c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:04.1382212Z DEVICE_NAME: 2025-12-04T09:45:04.1382368Z DEVICE_TYPE: 2025-12-04T09:45:04.1382523Z ##[endgroup] 2025-12-04T09:45:26.4641969Z c2ae7847e2aa 2025-12-04T09:45:27.2511653Z Deleted Containers: 2025-12-04T09:45:27.2512088Z c2ae7847e2aafd5d52f3ce4ca7b43f721da17c1dd22c8b80be2b080cb30eafd7 2025-12-04T09:45:27.2512407Z 2025-12-04T09:45:38.9311704Z Deleted Images: 2025-12-04T09:45:38.9313991Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:38.9316064Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image@sha256:ba21003510dba4bdeed83df81a56fa468e0ee1b612a9445ae1f402a280804f97 2025-12-04T09:45:38.9317399Z deleted: sha256:add7313791033822205cdb3cf32096534b2cfaa4855bd48119b59000bfe00301 2025-12-04T09:45:38.9317935Z deleted: sha256:85a76b7bf29ad34eb76cce6f46af5d49a58b6272f80f983d5c769e82c7749301 2025-12-04T09:45:38.9318429Z deleted: sha256:0882f3ce59ff5ae30195ee4b059fc713e13eda107a3a7814a4616ac9058a30a4 2025-12-04T09:45:38.9319293Z deleted: sha256:64ba5b9344c11a3e4729136076830b90ac4cf1554046edb1bd4f0784b66ebd9b 2025-12-04T09:45:38.9319750Z deleted: sha256:88213c59cf461a65ab9b6cb07b4195dc9d41b5241c152daa002c7b3112e09124 2025-12-04T09:45:38.9320199Z deleted: sha256:4c0f83afa802ffbc05ebaf1aa50e48a2447c7c295549a6dded80ac63437906ca 2025-12-04T09:45:38.9320646Z deleted: sha256:6f7ec74460e8fb070c8209949095ea3be5f4e2fd69c9f750cd39ac4093f5e64b 2025-12-04T09:45:38.9321087Z deleted: sha256:d6928b0d1021b31942fdcb64e5eb4a34682de66e959dd424ed6ed02c29cd706d 2025-12-04T09:45:38.9321867Z deleted: sha256:4e9fbcb1705a6351bb34dd320558752614308636b94fd9ae6f26063e3deadc0a 2025-12-04T09:45:38.9322603Z deleted: sha256:43aabd0201f48712f21758071352dea029b4de37be08b2e2197706856a9ecbf2 2025-12-04T09:45:38.9323177Z deleted: sha256:940a98dec78303f0548beb1033242a45e9097607ef3e55c8b949b69b73d1b95e 2025-12-04T09:45:38.9323623Z deleted: sha256:d2849fa0e0411cf66e4408831d70e38838afb55b11a80c1c4d8aa0ae7dc9ca40 2025-12-04T09:45:38.9324193Z deleted: sha256:14f40d23c20c7e562623f89deb376520296758bc39dd3c77284049b84ebd8a31 2025-12-04T09:45:38.9324942Z deleted: sha256:a8ccba61f90ca097cb391d0f4fbed0d9f821d06b00e28f7332e9e2dcfcbac4ca 2025-12-04T09:45:38.9325398Z deleted: sha256:91b2060d290547d3b517d4a11d994bbe23f4560b5546cb91918ca1828dde6be1 2025-12-04T09:45:38.9325821Z deleted: sha256:b42a184755715dcfead7fad655a127433541d316d9628f5f730ff17ad5f8071c 2025-12-04T09:45:38.9326266Z deleted: sha256:aa5b4f3c9169061dc3c6da0e677e8a86f11ecb0a3f9fb4861ab3d8c04379775c 2025-12-04T09:45:38.9326716Z deleted: sha256:b4dcf450081a48d77fea0a21b8d810a69c03608a595e754fe7d365058d0579b7 2025-12-04T09:45:38.9327168Z deleted: sha256:4f7fe12d3d4f5bf890c7ada4ce16f17a105472aa6509a778f917dcce2f28174b 2025-12-04T09:45:38.9327608Z deleted: sha256:2d1d5a74182594f9a8553df00fdcfc809dba407bcd6700d667f862cbe9d555ce 2025-12-04T09:45:38.9328060Z deleted: sha256:d901e2f5d449aeed16b727bdcc11fc0e0f6c30c8fc5c39ac7eeac8a74d9d176c 2025-12-04T09:45:38.9328683Z deleted: sha256:a04df2603bd12372c6632469a9a81ebc4a8d677452c250672b9692884fa6a452 2025-12-04T09:45:38.9329112Z deleted: sha256:f438a6b52273a552dc3820a55c74c53a62a0eae9f2a7d21b37125add7d71639f 2025-12-04T09:45:38.9329551Z deleted: sha256:d4b09517e9518d709ac98b0ae6f8446ec9ac51688253607b1fca67aa2c87b3f4 2025-12-04T09:45:38.9329989Z deleted: sha256:c1fa38335237f5e7263e39d3d3de98215bcfbbb12b826955c02e149bf68efd13 2025-12-04T09:45:38.9330423Z deleted: sha256:c898d20a30de901fca74d7611663b17ab48e1726a11e031e40548ed16ee81877 2025-12-04T09:45:38.9330970Z deleted: sha256:3baceec7096518fcc10696feba551639d698b3145c2fc09cac927bb60c0fd751 2025-12-04T09:45:38.9331415Z deleted: sha256:5245aaaa3d5c3a19f76b9a6c920bd82d1a0ff5289f87c8c109652089709d9b3b 2025-12-04T09:45:38.9331855Z deleted: sha256:f05cc789b95246938c377f474c41187965b89ceac0250e7d5124bec32153f447 2025-12-04T09:45:38.9332406Z deleted: sha256:07ec4fc008de4e7a2c794ec7094cc72e0d287c04c8b2156163aee0bae147fe2d 2025-12-04T09:45:38.9332864Z deleted: sha256:c6302601ad5fde573c1f8c900250478fca7fdc6907d8fd4fae651b94b4d9264d 2025-12-04T09:45:38.9333308Z deleted: sha256:cc5e955ee1dc54931f02606c5ea87aae14f03b5d764492be611480ab041f2882 2025-12-04T09:45:38.9333772Z deleted: sha256:f21c03518996d98452338f4e80bcfd9b139a1dab155f4830be0d3f623035269f 2025-12-04T09:45:38.9334298Z deleted: sha256:519ca6f1279f7886f25f0005527cfa627deebbc5b7d7cdbfa7ef962bcfc4c26d 2025-12-04T09:45:38.9334728Z deleted: sha256:0ef990495216807d0175b192045be3f617e72331bc373b3434807f41bf69168d 2025-12-04T09:45:38.9335152Z deleted: sha256:7093edf7319e1f0e01654c3224e32c8dede5b948d106e0b9b03cbf0bb1091e33 2025-12-04T09:45:38.9335584Z deleted: sha256:c478161e058e2f4041555c3e880b95ee1ee047938dc58549a3a88135740996ae 2025-12-04T09:45:38.9336006Z deleted: sha256:9bb853b0d938cd7c36a80ce8ee40653f2c0ff92719209b11beb03acc8855ce3e 2025-12-04T09:45:38.9336445Z deleted: sha256:fdf2ace71a78ce6910ef9c4b073c195531da47022443b606bb92dcd6499b6afc 2025-12-04T09:45:38.9336886Z deleted: sha256:576c2b3770d871937d3cfb7014328bcb4bd1aed0c28bc438764b3bfdac4c1ac2 2025-12-04T09:45:38.9337326Z deleted: sha256:878e92b9cb82de09ac14a9d5f3f7bc2411a799b6f54d0d64b78c2bb4d1fdc0fc 2025-12-04T09:45:38.9337901Z deleted: sha256:85c8c3b98b65a6695f988a10cc66c981d73a3ef03eda15b8e14d227b50b56300 2025-12-04T09:45:38.9338354Z deleted: sha256:ce2ab3ba07794f9ee95d6ea7de6dcd3d2aed96561f9a79192dd56ca5bf29313a 2025-12-04T09:45:38.9338786Z deleted: sha256:37a6e12976ca957286977e696e63012ab9821214b0483fe1a48d29dcb280508a 2025-12-04T09:45:38.9339211Z deleted: sha256:cd1d5d3dd7038144ca6fe961c0d4c8e705625ae0c36190ba8b3e9602abedad19 2025-12-04T09:45:38.9339656Z deleted: sha256:0e707276e0be2e0008b86d594fadc0d16444d66c4fb7227c56f144cbb3c2affd 2025-12-04T09:45:38.9340094Z deleted: sha256:22d4aad6a2ada91b341c1225a0f314042b8aeabef7568c5c019709b058bf070b 2025-12-04T09:45:38.9340534Z deleted: sha256:ee4adacf4e0933131d0275eddad406b3c8147e6cf07a292b99f1aff4b5355f33 2025-12-04T09:45:38.9340988Z deleted: sha256:43da0b9e7c0e18403dcb834e53628dc7c970ccb2dbd091878c0d7c0170dbc97f 2025-12-04T09:45:38.9341429Z deleted: sha256:00571684bdcd75beda15eb7d4e79b5458bc914350f9bb4d87fcdc97ad15e0da1 2025-12-04T09:45:38.9341871Z deleted: sha256:41615f09950259f1d75e82ef35b6fc53b18fe71ebff143744cfd51009d04349e 2025-12-04T09:45:38.9342301Z deleted: sha256:75ab34d2eed3c7915467a506ab6dab2711918fbabe94add2fb5c62780221ab0c 2025-12-04T09:45:38.9342749Z deleted: sha256:0a39ef2bebf44c1c3893d1e5fb42dad48b8fac7ca673141267ee967f85455e89 2025-12-04T09:45:38.9343192Z deleted: sha256:9b7d024e48ba1f9824a54597621b1b062cbc4aa41a77d81ca538d6b5c24a612c 2025-12-04T09:45:38.9343616Z deleted: sha256:392257172de6434c271bd93394218a91e9aa86d7c18abc2f2759317b9d5fb6de 2025-12-04T09:45:38.9344033Z deleted: sha256:6c3232860b930866a463a356124fc392c7e5f04895695229257e8c3e8a02711d 2025-12-04T09:45:38.9344458Z deleted: sha256:63dd55b807215e2fa6c715419ac0c5072d02dddc848dbf74bb7e77b906b5eaed 2025-12-04T09:45:38.9344891Z deleted: sha256:07a8738c1b4584db72ed9aa60f5274321eb0ba16263450da3a75df8326ebc25f 2025-12-04T09:45:38.9345364Z deleted: sha256:053fe2965b01281d12040ec1893e0d1aa77362a49ea9a1067402272c69dad9f5 2025-12-04T09:45:38.9345803Z deleted: sha256:7857fb5eb181c4e80262ecab60bdd3c266cf3d1409ceb76c05882609b416a8d3 2025-12-04T09:45:38.9346260Z deleted: sha256:752528477fc99089de3bd2c6da7b30cf34f2e901fe06d8fcfe685b411461e883 2025-12-04T09:45:38.9346701Z deleted: sha256:cce0210e2f4b042601813df03aa294a86b0c668fcfc75f4c63f6fa12b2952e15 2025-12-04T09:45:38.9347154Z deleted: sha256:f2bb405a26705ecd12d21380d26d9355d01db3a2175080fbdb468f2b5a25a76c 2025-12-04T09:45:38.9347601Z deleted: sha256:ad430120d4ffbaf97cd8d6de6ea8eefa4a8f80ec45f0b176c6b26bff0970fd33 2025-12-04T09:45:38.9348101Z deleted: sha256:225a4910baea7cc540ed43eeac75046293800ab0b8e0192b51e991c8cb50bcf3 2025-12-04T09:45:38.9348541Z deleted: sha256:a259945b0c3507f049fbac10fb3d3ffe43d45e83c91b80ae8cd1dafb855ad83c 2025-12-04T09:45:38.9348979Z deleted: sha256:862a98881b1d5adad5c21d01602773b894794097de80964ef8f47bcaadb43255 2025-12-04T09:45:38.9349411Z deleted: sha256:1cf6d3c8b6c2694b79a2d08719594903811c330a36a4c7a8a7153a350b53d292 2025-12-04T09:45:38.9349851Z deleted: sha256:232a1ae8b0fee817ff7838bb5986a2f38377d3b1dbbf5217b576af0f953b0844 2025-12-04T09:45:38.9350294Z deleted: sha256:c72c5705dabd6314423dd7d4fb260a20d5d9886b2ebce60d19e9d78c4a2335c2 2025-12-04T09:45:38.9350727Z deleted: sha256:296734cf81fd92c913884d058908598424ffe072676e38de289bbab83768c7bd 2025-12-04T09:45:38.9351150Z deleted: sha256:7c76040481b889847a1804021aeff07547eaa4ee706d6137db218d497a8fd9c1 2025-12-04T09:45:38.9351582Z deleted: sha256:d5e293f5b354e8cbcc6de893ea72cc632b02d8fdfbb08ec3127c4e9662f3ebff 2025-12-04T09:45:38.9352028Z deleted: sha256:f35a64e429c88e249645090f21fbe7dae108d98e0ab4ea13184f24b3fd66c315 2025-12-04T09:45:38.9352465Z deleted: sha256:ce6ae8d595c8e69115c51b1ce4f9a9158795d7b863b1cb53f21c39a87974d41b 2025-12-04T09:45:38.9352909Z deleted: sha256:8941abaee59400fb9b3a60765fea4a1fc2a6a447467a6d983e84c7f72494a450 2025-12-04T09:45:38.9353353Z deleted: sha256:ef53c29a9a2c2bc80ffdb9bfaf92842436b5755ec1ce828b9d11e5e27d656ea1 2025-12-04T09:45:38.9353801Z deleted: sha256:7a347fb0acb43f1c814f8c8ff21185e8b5cf64d7bc5988cea060f77d906e08b5 2025-12-04T09:45:38.9354329Z deleted: sha256:cc855dc9be79496e15175569dced2d13477e50b077a5fd3945f9bf50018880c1 2025-12-04T09:45:38.9354767Z deleted: sha256:f7a9946ada3d4786658bc0b643808bb32a9a45e4e90e30dc43ee19e2dbe24024 2025-12-04T09:45:38.9355194Z deleted: sha256:c22a9215f62812c1d2e32827f5221ff556c5b6702aadbdab6b87b8293f19635e 2025-12-04T09:45:38.9355669Z deleted: sha256:959a56746620012e37c1def1a83c5afb1e7c0adc59b021a28beb53c24df98032 2025-12-04T09:45:38.9356120Z deleted: sha256:31a0fff0695bf6100c17954be72eab2095b466d559c75c3faf2a17d8c41e6ebe 2025-12-04T09:45:38.9356552Z deleted: sha256:c15e2b5241b9e55af1b2593e544391b4b44d0505e6528e8f12425136e93b424c 2025-12-04T09:45:38.9356979Z deleted: sha256:73974f74b436f39a2fdb6461b1e3f7c3e41c73325776fa71d16b942a5b4a365b 2025-12-04T09:45:38.9357344Z untagged: public.ecr.aws/docker/library/python:3.13 2025-12-04T09:45:38.9357843Z untagged: public.ecr.aws/docker/library/python@sha256:3f986299a7b8b44b0d8cf9bda2b22361ce5c3058ef5d7cb17fb7452506680ab0 2025-12-04T09:45:38.9358433Z deleted: sha256:44438aecfedf7b6086fce506dae0db5ba7fc0027f9b743f1a75a6b5cbc7de70a 2025-12-04T09:45:38.9358879Z deleted: sha256:6f09a1f5d8a107c2532fbd116e75116cb75fa77b1a7d72d3bdf1ac12de152acd 2025-12-04T09:45:38.9359323Z deleted: sha256:fe5f3ac0be086125eb1e3cd10cc33e8e426f4e079381f7ce5a987b626e99fa67 2025-12-04T09:45:38.9359771Z deleted: sha256:79dd2061a22cf919cfc4f1f02704bfda09afadb017265e670ee54441d296c06c 2025-12-04T09:45:38.9360214Z deleted: sha256:9447ad402aafdbee17e999b0ec84ad89c2646dbebf054d469d4f8bee77f66212 2025-12-04T09:45:38.9360655Z deleted: sha256:7a4909f3c1975be52292f53107495ee1b41c17494918767ccedf1cf1688ae318 2025-12-04T09:45:38.9361081Z deleted: sha256:3474923d97f1f498237650a7d51bd4aea37d5e6b9d8a778777920584af5dd560 2025-12-04T09:45:38.9361511Z deleted: sha256:683afd1773444401a9cbd24842ee5d9154a11abb4fab63ddea5c03df788597ee 2025-12-04T09:45:38.9362100Z 2025-12-04T09:45:38.9362222Z Total reclaimed space: 35.47GB 2025-12-04T09:45:38.9451639Z Post job cleanup. 2025-12-04T09:45:38.9481522Z Post job cleanup. 2025-12-04T09:45:39.0883426Z (node:100065) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2025-12-04T09:45:39.0884182Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-12-04T09:45:39.1048250Z Post job cleanup. 2025-12-04T09:45:39.1084917Z Post job cleanup. 2025-12-04T09:45:39.2043128Z [command]/usr/bin/git version 2025-12-04T09:45:39.2080094Z git version 2.50.1 2025-12-04T09:45:39.2113197Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/95eb2968-cc00-4b6f-9806-9851a2277d87/.gitconfig' 2025-12-04T09:45:39.2122716Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/95eb2968-cc00-4b6f-9806-9851a2277d87' before making global git config changes 2025-12-04T09:45:39.2123619Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T09:45:39.2128091Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:45:39.2169577Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T09:45:39.2201267Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T09:45:39.2550331Z Entering 'android/libs/fbjni' 2025-12-04T09:45:39.2615876Z Entering 'third_party/FP16' 2025-12-04T09:45:39.2684463Z Entering 'third_party/FXdiv' 2025-12-04T09:45:39.2748229Z Entering 'third_party/NNPACK' 2025-12-04T09:45:39.2817405Z Entering 'third_party/NVTX' 2025-12-04T09:45:39.2882390Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:45:39.2946273Z Entering 'third_party/XNNPACK' 2025-12-04T09:45:39.3031550Z Entering 'third_party/aiter' 2025-12-04T09:45:39.3099968Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:45:39.3177788Z Entering 'third_party/benchmark' 2025-12-04T09:45:39.3243216Z Entering 'third_party/composable_kernel' 2025-12-04T09:45:39.3315039Z Entering 'third_party/cpp-httplib' 2025-12-04T09:45:39.3379477Z Entering 'third_party/cpuinfo' 2025-12-04T09:45:39.3444529Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:45:39.3510230Z Entering 'third_party/cutlass' 2025-12-04T09:45:39.3587420Z Entering 'third_party/fbgemm' 2025-12-04T09:45:39.3659005Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:45:39.3721481Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:45:39.3792576Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:45:39.3855967Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:45:39.3926152Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:45:39.3989603Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:45:39.4052856Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:45:39.4119899Z Entering 'third_party/flash-attention' 2025-12-04T09:45:39.4183906Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:45:39.4252950Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:45:39.4325442Z Entering 'third_party/flatbuffers' 2025-12-04T09:45:39.4397984Z Entering 'third_party/fmt' 2025-12-04T09:45:39.4468926Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:45:39.4533766Z Entering 'third_party/gloo' 2025-12-04T09:45:39.4596741Z Entering 'third_party/googletest' 2025-12-04T09:45:39.4658029Z Entering 'third_party/ideep' 2025-12-04T09:45:39.4722355Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:45:39.4793039Z Entering 'third_party/ittapi' 2025-12-04T09:45:39.4861738Z Entering 'third_party/kineto' 2025-12-04T09:45:39.4927492Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:45:39.4995536Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:45:39.5060229Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:45:39.5127966Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:45:39.5197153Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:45:39.5266473Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:45:39.5332478Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:45:39.5396029Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:45:39.5459480Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:45:39.5528447Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:45:39.5597240Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:45:39.5667447Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:39.5732081Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:39.5805744Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:45:39.5869670Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:45:39.5939014Z Entering 'third_party/kleidiai' 2025-12-04T09:45:39.6004996Z Entering 'third_party/mimalloc' 2025-12-04T09:45:39.6069228Z Entering 'third_party/nlohmann' 2025-12-04T09:45:39.6139920Z Entering 'third_party/onnx' 2025-12-04T09:45:39.6221705Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:45:39.6287409Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:45:39.6356389Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:45:39.6418745Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:45:39.6484694Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:45:39.6551549Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:45:39.6621510Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:45:39.6684304Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:45:39.6754026Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:45:39.6819179Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:39.6889068Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:39.6954524Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:45:39.7040254Z Entering 'third_party/pocketfft' 2025-12-04T09:45:39.7105446Z Entering 'third_party/protobuf' 2025-12-04T09:45:39.7171318Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:45:39.7238432Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:45:39.7306402Z Entering 'third_party/psimd' 2025-12-04T09:45:39.7375635Z Entering 'third_party/pthreadpool' 2025-12-04T09:45:39.7438563Z Entering 'third_party/pybind11' 2025-12-04T09:45:39.7500142Z Entering 'third_party/python-peachpy' 2025-12-04T09:45:39.7569766Z Entering 'third_party/sleef' 2025-12-04T09:45:39.7634161Z Entering 'third_party/tensorpipe' 2025-12-04T09:45:39.7695708Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:45:39.7758077Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:45:39.7825400Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:45:39.7888689Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:45:39.7949168Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:45:39.8044064Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T09:45:39.8067060Z http.https://github.com/.extraheader 2025-12-04T09:45:39.8076195Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-12-04T09:45:39.8107656Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T09:45:39.8434606Z Entering 'android/libs/fbjni' 2025-12-04T09:45:39.8481996Z http.https://github.com/.extraheader 2025-12-04T09:45:39.8526228Z Entering 'third_party/FP16' 2025-12-04T09:45:39.8573170Z http.https://github.com/.extraheader 2025-12-04T09:45:39.8614319Z Entering 'third_party/FXdiv' 2025-12-04T09:45:39.8654715Z http.https://github.com/.extraheader 2025-12-04T09:45:39.8700337Z Entering 'third_party/NNPACK' 2025-12-04T09:45:39.8742098Z http.https://github.com/.extraheader 2025-12-04T09:45:39.8780294Z Entering 'third_party/NVTX' 2025-12-04T09:45:39.8822157Z http.https://github.com/.extraheader 2025-12-04T09:45:39.8861069Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:45:39.8902908Z http.https://github.com/.extraheader 2025-12-04T09:45:39.8948624Z Entering 'third_party/XNNPACK' 2025-12-04T09:45:39.8990951Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9044045Z Entering 'third_party/aiter' 2025-12-04T09:45:39.9083773Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9129032Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:45:39.9171735Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9228185Z Entering 'third_party/benchmark' 2025-12-04T09:45:39.9270904Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9315705Z Entering 'third_party/composable_kernel' 2025-12-04T09:45:39.9361660Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9407591Z Entering 'third_party/cpp-httplib' 2025-12-04T09:45:39.9450714Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9490222Z Entering 'third_party/cpuinfo' 2025-12-04T09:45:39.9531793Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9571617Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:45:39.9612729Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9658613Z Entering 'third_party/cutlass' 2025-12-04T09:45:39.9701314Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9757327Z Entering 'third_party/fbgemm' 2025-12-04T09:45:39.9802138Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9847939Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:45:39.9891002Z http.https://github.com/.extraheader 2025-12-04T09:45:39.9929089Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:45:39.9974116Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0022617Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:45:40.0064118Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0109303Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:45:40.0151129Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0198678Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:45:40.0240726Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0279415Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:45:40.0321338Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0359474Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:45:40.0401905Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0450211Z Entering 'third_party/flash-attention' 2025-12-04T09:45:40.0495666Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0539678Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:45:40.0581714Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0630454Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:45:40.0672109Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0727177Z Entering 'third_party/flatbuffers' 2025-12-04T09:45:40.0774023Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0821814Z Entering 'third_party/fmt' 2025-12-04T09:45:40.0864465Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0909326Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:45:40.0951803Z http.https://github.com/.extraheader 2025-12-04T09:45:40.0990592Z Entering 'third_party/gloo' 2025-12-04T09:45:40.1032164Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1070781Z Entering 'third_party/googletest' 2025-12-04T09:45:40.1112488Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1158541Z Entering 'third_party/ideep' 2025-12-04T09:45:40.1201152Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1244050Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:45:40.1286303Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1334865Z Entering 'third_party/ittapi' 2025-12-04T09:45:40.1377833Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1417232Z Entering 'third_party/kineto' 2025-12-04T09:45:40.1463997Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1508365Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:45:40.1548714Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1586736Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:45:40.1627858Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1670464Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:45:40.1712270Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1751489Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:45:40.1793778Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1835544Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:45:40.1877475Z http.https://github.com/.extraheader 2025-12-04T09:45:40.1916398Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:45:40.1957940Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2000850Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:45:40.2042537Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2084915Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:45:40.2128049Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2168564Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:45:40.2214483Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2260427Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:45:40.2301282Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2349781Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:45:40.2392321Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2433149Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:40.2474434Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2518376Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:40.2564126Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2614252Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:45:40.2654563Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2694890Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:45:40.2734852Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2779059Z Entering 'third_party/kleidiai' 2025-12-04T09:45:40.2821636Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2860705Z Entering 'third_party/mimalloc' 2025-12-04T09:45:40.2904040Z http.https://github.com/.extraheader 2025-12-04T09:45:40.2949416Z Entering 'third_party/nlohmann' 2025-12-04T09:45:40.2992578Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3035051Z Entering 'third_party/onnx' 2025-12-04T09:45:40.3077196Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3131039Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:45:40.3173263Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3221531Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:45:40.3263406Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3309428Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:45:40.3351049Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3389875Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:45:40.3431154Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3469870Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:45:40.3511337Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3550163Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:45:40.3588708Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3628604Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:45:40.3669416Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3707933Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:45:40.3748266Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3787675Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:45:40.3827689Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3866291Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:40.3907110Z http.https://github.com/.extraheader 2025-12-04T09:45:40.3949107Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:40.3991644Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4033501Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:45:40.4073696Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4132999Z Entering 'third_party/pocketfft' 2025-12-04T09:45:40.4173964Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4214770Z Entering 'third_party/protobuf' 2025-12-04T09:45:40.4256508Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4302759Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:45:40.4343120Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4387789Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:45:40.4428221Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4470857Z Entering 'third_party/psimd' 2025-12-04T09:45:40.4512349Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4549738Z Entering 'third_party/pthreadpool' 2025-12-04T09:45:40.4593243Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4634130Z Entering 'third_party/pybind11' 2025-12-04T09:45:40.4681820Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4723193Z Entering 'third_party/python-peachpy' 2025-12-04T09:45:40.4765084Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4805140Z Entering 'third_party/sleef' 2025-12-04T09:45:40.4847241Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4887270Z Entering 'third_party/tensorpipe' 2025-12-04T09:45:40.4933773Z http.https://github.com/.extraheader 2025-12-04T09:45:40.4977434Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:45:40.5023292Z http.https://github.com/.extraheader 2025-12-04T09:45:40.5064421Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:45:40.5104413Z http.https://github.com/.extraheader 2025-12-04T09:45:40.5148192Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:45:40.5188521Z http.https://github.com/.extraheader 2025-12-04T09:45:40.5227775Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:45:40.5267206Z http.https://github.com/.extraheader 2025-12-04T09:45:40.5307847Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:45:40.5348214Z http.https://github.com/.extraheader 2025-12-04T09:45:40.5413140Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:40.5444405Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T09:45:40.5780954Z Entering 'android/libs/fbjni' 2025-12-04T09:45:40.5807945Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T09:45:40.5827629Z Entering 'third_party/FP16' 2025-12-04T09:45:40.5856534Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T09:45:40.5875729Z Entering 'third_party/FXdiv' 2025-12-04T09:45:40.5904236Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T09:45:40.5924375Z Entering 'third_party/NNPACK' 2025-12-04T09:45:40.5952971Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T09:45:40.5972012Z Entering 'third_party/NVTX' 2025-12-04T09:45:40.6002637Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T09:45:40.6024523Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:45:40.6056653Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T09:45:40.6076845Z Entering 'third_party/XNNPACK' 2025-12-04T09:45:40.6107861Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T09:45:40.6141749Z Entering 'third_party/aiter' 2025-12-04T09:45:40.6169722Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T09:45:40.6188910Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:45:40.6215876Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T09:45:40.6244783Z Entering 'third_party/benchmark' 2025-12-04T09:45:40.6273069Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:45:40.6293976Z Entering 'third_party/composable_kernel' 2025-12-04T09:45:40.6322548Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T09:45:40.6348574Z Entering 'third_party/cpp-httplib' 2025-12-04T09:45:40.6374782Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T09:45:40.6394721Z Entering 'third_party/cpuinfo' 2025-12-04T09:45:40.6423314Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T09:45:40.6443789Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:45:40.6469934Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T09:45:40.6489397Z Entering 'third_party/cutlass' 2025-12-04T09:45:40.6517046Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T09:45:40.6545997Z Entering 'third_party/fbgemm' 2025-12-04T09:45:40.6577377Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T09:45:40.6598074Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:45:40.6625512Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T09:45:40.6645643Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:45:40.6672181Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T09:45:40.6700069Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:45:40.6726612Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T09:45:40.6746670Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:45:40.6779335Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T09:45:40.6806916Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:45:40.6839044Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T09:45:40.6858153Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:45:40.6887100Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T09:45:40.6905994Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:45:40.6938113Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T09:45:40.6961259Z Entering 'third_party/flash-attention' 2025-12-04T09:45:40.6989518Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T09:45:40.7008437Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:45:40.7035677Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T09:45:40.7060767Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:45:40.7088205Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T09:45:40.7116380Z Entering 'third_party/flatbuffers' 2025-12-04T09:45:40.7148708Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T09:45:40.7171301Z Entering 'third_party/fmt' 2025-12-04T09:45:40.7198543Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T09:45:40.7218097Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:45:40.7246406Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T09:45:40.7266767Z Entering 'third_party/gloo' 2025-12-04T09:45:40.7294903Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T09:45:40.7315161Z Entering 'third_party/googletest' 2025-12-04T09:45:40.7343557Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:45:40.7364389Z Entering 'third_party/ideep' 2025-12-04T09:45:40.7397509Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T09:45:40.7415820Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:45:40.7443497Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T09:45:40.7473413Z Entering 'third_party/ittapi' 2025-12-04T09:45:40.7501959Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T09:45:40.7525876Z Entering 'third_party/kineto' 2025-12-04T09:45:40.7557917Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T09:45:40.7576859Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:45:40.7604383Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T09:45:40.7623446Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:45:40.7649671Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T09:45:40.7670659Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:45:40.7698253Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T09:45:40.7717380Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:45:40.7749833Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T09:45:40.7769726Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:45:40.7797218Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T09:45:40.7815008Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:45:40.7841875Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T09:45:40.7863603Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:45:40.7892121Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T09:45:40.7911065Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:45:40.7938671Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:45:40.7958974Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:45:40.7988028Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T09:45:40.8008522Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:45:40.8035901Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T09:45:40.8056179Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:45:40.8082721Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T09:45:40.8105495Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:40.8133565Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T09:45:40.8156209Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:40.8183887Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T09:45:40.8211651Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:45:40.8240944Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T09:45:40.8260051Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:45:40.8288034Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T09:45:40.8310734Z Entering 'third_party/kleidiai' 2025-12-04T09:45:40.8340853Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T09:45:40.8361154Z Entering 'third_party/mimalloc' 2025-12-04T09:45:40.8388715Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T09:45:40.8408337Z Entering 'third_party/nlohmann' 2025-12-04T09:45:40.8435881Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T09:45:40.8457449Z Entering 'third_party/onnx' 2025-12-04T09:45:40.8483967Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T09:45:40.8517902Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:45:40.8545729Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:45:40.8569543Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:45:40.8597706Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T09:45:40.8617660Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:45:40.8644514Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:45:40.8665737Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:45:40.8693499Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:45:40.8713492Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:45:40.8739567Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T09:45:40.8758363Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:45:40.8785335Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T09:45:40.8806006Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:45:40.8833877Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T09:45:40.8853481Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:45:40.8878928Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T09:45:40.8898266Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:45:40.8925676Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T09:45:40.8943453Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:40.8970264Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T09:45:40.8989797Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:40.9017524Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T09:45:40.9041388Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:45:40.9070933Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T09:45:40.9107600Z Entering 'third_party/pocketfft' 2025-12-04T09:45:40.9136028Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T09:45:40.9155828Z Entering 'third_party/protobuf' 2025-12-04T09:45:40.9182586Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T09:45:40.9208071Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:45:40.9235410Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:45:40.9254805Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:45:40.9280420Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:45:40.9301989Z Entering 'third_party/psimd' 2025-12-04T09:45:40.9329397Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T09:45:40.9349680Z Entering 'third_party/pthreadpool' 2025-12-04T09:45:40.9377363Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T09:45:40.9396800Z Entering 'third_party/pybind11' 2025-12-04T09:45:40.9429370Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:45:40.9449045Z Entering 'third_party/python-peachpy' 2025-12-04T09:45:40.9475703Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T09:45:40.9495521Z Entering 'third_party/sleef' 2025-12-04T09:45:40.9523811Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T09:45:40.9544373Z Entering 'third_party/tensorpipe' 2025-12-04T09:45:40.9570566Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T09:45:40.9588629Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:45:40.9615637Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:45:40.9635201Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:45:40.9660998Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T09:45:40.9679785Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:45:40.9706483Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T09:45:40.9725980Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:45:40.9751689Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:45:40.9769764Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:45:40.9795708Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T09:45:40.9840347Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:40.9875766Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:40.9902024Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:40.9928855Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:40.9954989Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:40.9981958Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0009679Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0034609Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0060805Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0086420Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0112997Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0140653Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0166409Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0192261Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0219247Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0243575Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0268531Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0294205Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0320548Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0345595Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0371777Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0397564Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0428718Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0452524Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0477697Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0502621Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0528866Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0553541Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0581299Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0605478Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0630246Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0655289Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0681386Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0707512Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0733734Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0758951Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0785104Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0811522Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0850515Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0866757Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0893276Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0919984Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0949035Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0975508Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.0999377Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1023602Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1049000Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1073351Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1098359Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1122181Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1147277Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1172872Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1199447Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1223918Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1250044Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1275356Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1300721Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1325874Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1350218Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1375341Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1400832Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1427735Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1451723Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1484200Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1508475Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1532341Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1558802Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1582919Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1608582Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1631992Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1655781Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1681869Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1709999Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1734911Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1759018Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1783228Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1807663Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1830646Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1854619Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1883100Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.1909887Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:45:41.2025790Z A job completed hook has been configured by the self-hosted runner administrator 2025-12-04T09:45:41.2043035Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-12-04T09:45:41.2049771Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:41.2050052Z ##[endgroup] 2025-12-04T09:45:41.2142665Z [!ALERT!] Swap in detected! [!ALERT!] 2025-12-04T09:45:50.7466790Z [!ALERT!] Swap out detected [!ALERT!] 2025-12-04T09:46:07.2831125Z Cleaning up orphan processes